This document discusses the evolution of Netflix's API architecture over time. It begins by explaining Netflix's initial "one-size-fits-all" REST API model and how requests grew exponentially as more devices were supported. This led Netflix to move away from resource-based APIs to experience-based APIs tailored for specific devices and use cases. Key aspects of the new design included reducing "chattiness", handling variability across devices, and supporting innovation at different rates. The document also discusses Netflix's approach to versioning APIs and making them more resilient through techniques like circuit breakers and fallbacks.
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
Revolutions have a common pattern in technology and this is no different for the API space. This presentation discusses that pattern and goes through various API revolutions. It also uses Netflix as an example of how some revolutions evolved and where things may be headed.
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.
Techniques for Scaling the Netflix API - QCon SFDaniel Jacobson
This presentation was from QCon SF 2011. In these slides I discuss various techniques that we use to scale the API. I also discuss in more detail our effort around redesigning the API.
Set Your Content Free! : Case Studies from Netflix and NPRDaniel Jacobson
Last Friday (February 8th), I spoke at the Intelligent Content Conference 2013. When Scott Abel (aka The Content Wrangler) first contacted me to speak at the event, he asked me to speak about my content management and distribution experiences from both NPR and Netflix. The two experiences seemed to him to be an interesting blend for the conference. These are the slides from that presentation.
I have applied comments to every slide in this presentation to include the context that I otherwise provided verbally during the talk.
Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
APIs for Internal Audiences - Netflix - App Dev ConferenceDaniel Jacobson
API programs, typically thought of as a public program to see what public developer communities can build with a company's data, are becoming more and more critical to the success of mobile and device strategies. This presentation takes a look at Netflix's and NPR's strategies that lead to tremendous growth and discusses how Netflix plans to take this internal API strategy to the next level.
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
Revolutions have a common pattern in technology and this is no different for the API space. This presentation discusses that pattern and goes through various API revolutions. It also uses Netflix as an example of how some revolutions evolved and where things may be headed.
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.
Techniques for Scaling the Netflix API - QCon SFDaniel Jacobson
This presentation was from QCon SF 2011. In these slides I discuss various techniques that we use to scale the API. I also discuss in more detail our effort around redesigning the API.
Set Your Content Free! : Case Studies from Netflix and NPRDaniel Jacobson
Last Friday (February 8th), I spoke at the Intelligent Content Conference 2013. When Scott Abel (aka The Content Wrangler) first contacted me to speak at the event, he asked me to speak about my content management and distribution experiences from both NPR and Netflix. The two experiences seemed to him to be an interesting blend for the conference. These are the slides from that presentation.
I have applied comments to every slide in this presentation to include the context that I otherwise provided verbally during the talk.
Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
APIs for Internal Audiences - Netflix - App Dev ConferenceDaniel Jacobson
API programs, typically thought of as a public program to see what public developer communities can build with a company's data, are becoming more and more critical to the success of mobile and device strategies. This presentation takes a look at Netflix's and NPR's strategies that lead to tremendous growth and discusses how Netflix plans to take this internal API strategy to the next level.
This is a presentation that I gave to ESPN's Digital Media team about the trajectory of the Netflix API. I also discussed Netflix's device implementation strategy and how it enables rapid development and robust A/B testing.
This is my presentation from the Business of APIs Conference in SF, held by Mashery (http://www.apiconference.com).
This talk talks briefly about the history of the Netflix API, then goes into three main categories of scaling:
1. Using the cloud to scale in size and internationally
2. Using Webkit to scale application development in parallel to the flexibility afforded by the API
3. Redesigning the API to improve performance and to downscale the infrastructure as the system scales
When viewing these slides, please note that they are almost entirely image-based, so I have added notes for each slide to detail the talking points.
Most API providers focus on solving all three of the key challenges for APIs: data gathering, data formatting and data delivery. All three of these functions are critical for the success of an API, however, not all should be solved by the API provider. Rather, the API consumers have a strong, vested interest in the formatting and delivery. As a result, API design should be addressed based on the true separation of concerns between the needs of the API provider and the various API consumers.
This presentation goes into the separation of concerns. It also goes into depth in how Netflix has solved for this problem through a very different approach to API design.
This presentation was given at the following API Meetup in SF:
http://www.meetup.com/API-Meetup/events/171255242/
I gave this presentation to the engineering team at PayPal. This presentation discusses the history and future of the Netflix API. It also goes into API design principles as well as concepts behind system scalability and resiliency.
This presentation demonstrates the great successes of the Netflix API to date. After some introspection, however, there is an opportunity to better prepare the API for the future. This presentation also offers a few ideas on how the Netflix API architecture may change over time.
Talk about the Netflix API and how it serves as the front door for Netflix device UIs. Topics include: API design, resiliency patterns, scalability, and enabling fast dev/deploy cycles.
History and Future of the Netflix API - Mashery Evolution of DistributionDaniel Jacobson
Presentation on the history and future of the Netflix API. This presentation walks through how the API was formed, why it needs a redesign and some of the principles that will be applied in the redesign effort.
This presentation was given at the Mashery Evolution of Distribution session in San Francisco on June 2, 2011.
Many API programs get launched without a clear understanding as to WHY the API should exist. Rather, many are focused on WHAT the API consists of and HOW it should be targeted, implemented and leveraged. This presentation focuses on establishing the need for a clear WHY proposition behind the decision. The HOW and then WHAT will follow from that.
This presentation also uses the history of the Netflix API to demonstrate the power, utility and importance of knowing WHY you are building an API.
What is it that turns an ordinary API into a great API? This talk from OSCON 2012 outlines the 5 "keys" to having a great API. Lots of examples from successful real-world APIs are used to highlight what matters. Also, this talk reveals 7 lesser known but very important "API secrets".
Migrating Automation Tests to Postman Monitors and ROIPostman
In this presentation I will be covering BetterCloud's approach to creating and finding patterns for complex authentication and identifying key performance and correctness metrics for ensuring API stability. We'll take a deep dive into BetterCloud's cost benefit analysis for migrating automated tests to Postman monitors and go over the challenges and successes faced along the way.
NPR: Digital Distribution Strategy: OSCON2010Daniel Jacobson
When launching the API at OSCON in 2008, NPR targeted four audiences: the open source community; NPR member stations; NPR partners and vendors; and finally our internal developers and product managers. In its short two-year life, the NPR API has grown tremendously, from only a few hundred thousand requests per month to more than 60M. The API, furthermore, has enabled tremendous growth for NPR in the mobile space while facilitating more than 100% growth in total page views in the last year.
KPIs for APIs (and how API Calls are the new Web Hits, and you may be measuri...John Musser
How do you measure API success? What KPIs do APIs need? What mistakes should I avoid? Find out what you should, and shouldn't, be measuring as part of your API program in this Business of APIs Conference NYC talk. Dive into a breadth of API metrics, the 6 keys to better API metrics, and the traps to beware of (the important do's and don'ts). Also real-world API case studies show who measures what.
Postman Galaxy Tour: San Francisco - Workshop PresentationPostman
Postman's Developer Advocate Lead, Joyce Lin, presented a 3-hour workshop walking through tons of Postman features, including:
Collections
Environments
Workspaces
Variables
Scripts
Tests
And more advanced features
At Netflix, we provide an API that supports the content discovery, sign-up, and playback experience on thousands of device types that millions use around the world every day. As our user base and traffic has grown by leaps and bounds, we are continuously evolving this API to be flexible, scalable, and resilient and enable the best experience for our users. In this talk, I gave an overview of how and why the Netflix API has evolved to where it is today and how we make it resilient against failures while keeping it flexible and nimble enough to support continuous A/B testing.
At Netflix, we provide a Java-based API that supports the content discovery, sign-up, and playback experience on thousands of device types that millions use around the world every day. As our user base and traffic have grown by leaps and bounds, we are continuously evolving this API to enable the best user experience. In this talk, I will give an overview of how and why the Netflix API has evolved to where it is today and where we plan to take it in the future. I will discuss how we make our system resilient against failures using tools such as Hystrix and FIT, while keeping it flexible and nimble enough to support continuous A/B testing.
This is a presentation that I gave to ESPN's Digital Media team about the trajectory of the Netflix API. I also discussed Netflix's device implementation strategy and how it enables rapid development and robust A/B testing.
This is my presentation from the Business of APIs Conference in SF, held by Mashery (http://www.apiconference.com).
This talk talks briefly about the history of the Netflix API, then goes into three main categories of scaling:
1. Using the cloud to scale in size and internationally
2. Using Webkit to scale application development in parallel to the flexibility afforded by the API
3. Redesigning the API to improve performance and to downscale the infrastructure as the system scales
When viewing these slides, please note that they are almost entirely image-based, so I have added notes for each slide to detail the talking points.
Most API providers focus on solving all three of the key challenges for APIs: data gathering, data formatting and data delivery. All three of these functions are critical for the success of an API, however, not all should be solved by the API provider. Rather, the API consumers have a strong, vested interest in the formatting and delivery. As a result, API design should be addressed based on the true separation of concerns between the needs of the API provider and the various API consumers.
This presentation goes into the separation of concerns. It also goes into depth in how Netflix has solved for this problem through a very different approach to API design.
This presentation was given at the following API Meetup in SF:
http://www.meetup.com/API-Meetup/events/171255242/
I gave this presentation to the engineering team at PayPal. This presentation discusses the history and future of the Netflix API. It also goes into API design principles as well as concepts behind system scalability and resiliency.
This presentation demonstrates the great successes of the Netflix API to date. After some introspection, however, there is an opportunity to better prepare the API for the future. This presentation also offers a few ideas on how the Netflix API architecture may change over time.
Talk about the Netflix API and how it serves as the front door for Netflix device UIs. Topics include: API design, resiliency patterns, scalability, and enabling fast dev/deploy cycles.
History and Future of the Netflix API - Mashery Evolution of DistributionDaniel Jacobson
Presentation on the history and future of the Netflix API. This presentation walks through how the API was formed, why it needs a redesign and some of the principles that will be applied in the redesign effort.
This presentation was given at the Mashery Evolution of Distribution session in San Francisco on June 2, 2011.
Many API programs get launched without a clear understanding as to WHY the API should exist. Rather, many are focused on WHAT the API consists of and HOW it should be targeted, implemented and leveraged. This presentation focuses on establishing the need for a clear WHY proposition behind the decision. The HOW and then WHAT will follow from that.
This presentation also uses the history of the Netflix API to demonstrate the power, utility and importance of knowing WHY you are building an API.
What is it that turns an ordinary API into a great API? This talk from OSCON 2012 outlines the 5 "keys" to having a great API. Lots of examples from successful real-world APIs are used to highlight what matters. Also, this talk reveals 7 lesser known but very important "API secrets".
Migrating Automation Tests to Postman Monitors and ROIPostman
In this presentation I will be covering BetterCloud's approach to creating and finding patterns for complex authentication and identifying key performance and correctness metrics for ensuring API stability. We'll take a deep dive into BetterCloud's cost benefit analysis for migrating automated tests to Postman monitors and go over the challenges and successes faced along the way.
NPR: Digital Distribution Strategy: OSCON2010Daniel Jacobson
When launching the API at OSCON in 2008, NPR targeted four audiences: the open source community; NPR member stations; NPR partners and vendors; and finally our internal developers and product managers. In its short two-year life, the NPR API has grown tremendously, from only a few hundred thousand requests per month to more than 60M. The API, furthermore, has enabled tremendous growth for NPR in the mobile space while facilitating more than 100% growth in total page views in the last year.
KPIs for APIs (and how API Calls are the new Web Hits, and you may be measuri...John Musser
How do you measure API success? What KPIs do APIs need? What mistakes should I avoid? Find out what you should, and shouldn't, be measuring as part of your API program in this Business of APIs Conference NYC talk. Dive into a breadth of API metrics, the 6 keys to better API metrics, and the traps to beware of (the important do's and don'ts). Also real-world API case studies show who measures what.
Postman Galaxy Tour: San Francisco - Workshop PresentationPostman
Postman's Developer Advocate Lead, Joyce Lin, presented a 3-hour workshop walking through tons of Postman features, including:
Collections
Environments
Workspaces
Variables
Scripts
Tests
And more advanced features
At Netflix, we provide an API that supports the content discovery, sign-up, and playback experience on thousands of device types that millions use around the world every day. As our user base and traffic has grown by leaps and bounds, we are continuously evolving this API to be flexible, scalable, and resilient and enable the best experience for our users. In this talk, I gave an overview of how and why the Netflix API has evolved to where it is today and how we make it resilient against failures while keeping it flexible and nimble enough to support continuous A/B testing.
At Netflix, we provide a Java-based API that supports the content discovery, sign-up, and playback experience on thousands of device types that millions use around the world every day. As our user base and traffic have grown by leaps and bounds, we are continuously evolving this API to enable the best user experience. In this talk, I will give an overview of how and why the Netflix API has evolved to where it is today and where we plan to take it in the future. I will discuss how we make our system resilient against failures using tools such as Hystrix and FIT, while keeping it flexible and nimble enough to support continuous A/B testing.
Being a female engineering leader means dealing with a host of interesting challenges, some good, some bad, and some ugly. I share experiences of female engineering leaders and provide a picture about what our daily life looks like. One of my goals is to give the “inside story” to men so they can better understand and provide the right kind of mentorship. Another goal is to give women with leadership ambitions a better understanding of the job.
I cover some of the bad news—how small the percentage of women leaders is, how difficult it is to hire women leaders, and how many women shy away from leadership positions in tech. I also touch on the ugly—the “war stories” of being a female leader, from thinly veiled innuendo to incredulity about our job titles (thankfully neither from colleagues)—before focusing on the good—why I and others have aspired to become engineering leaders and what we love about a job that allows us to build great technology, work with great people, and help people develop their careers. I emphasize the importance of both female and male mentors as well as the importance of working against conscious and unconscious bias, and I conclude by looking ahead to the future, offering some concrete lessons to take away.
The Netflix API Platform for Server-Side ScriptingKatharina Probst
Presented at QCon NYC 2016.
The Netflix API is the front-door for almost all device/UI requests from 1000+ device types to the Netflix backends. It serves everything from movie and show recommendations, profile, sign-up, and A/B test related functionality, to bookmarks and licenses for playback.
Because all devices use this API, and because Netflix runs on devices of widely varying sizes and interaction models, it has served us well to enable a platform against which device teams write server-side scripts. Using Netflix as an example, the goal of this talk is to explore situations in which server-side scripting is a good solution for applications. I will describe our first approach, which uses Groovy scripts. I will detail how the scripts are uploaded and can make use of shared modules. This approach allows for high flexibility and performance as well as high developer velocity, at the expense of added risk of injecting scripts into running servers. I will then dive into a new approach that will isolate the scripts into their own containers without compromising the original goals and will allow teams to write scripts in node.js, a language that is more natural for them.
Developing applications with a microservice architecture (SVforum, microservi...Chris Richardson
Here is the version of my microservices talk that that I gave on September 17th at the SVforum Cloud SIG/Microservices meetup.
To learn more see http://microservices.io and http://plainoldobjects.com
Beyond DevOps - How Netflix Bridges the GapJosh Evans
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. Simultaneously improving service quality and enabling rapid, continuous change seems impossible on the surface.
At Netflix, Operations Engineering is a centralized organization whose charter is to accomplish just that by applying high-leverage software engineering practices like continuous delivery. real-time analytics, and automation to solve operational problems. It's well established that many traditional IT Operations teams struggle to bridge the gap with software engineering. Operations Engineering is no exception. And while DevOps as a construct seeks to address this gap, it doesn't go far enough. It does not explain how to bridge the gap or even why it's important to do so.
In this talk we’ll use Netflix Operations Engineering as a case study to address these questions. We'll explore common challenges faced by operational teams and strategies to overcome them.
How to Become a Thought Leader in Your NicheLeslie Samuel
Are bloggers thought leaders? Here are some tips on how you can become one. Provide great value, put awesome content out there on a regular basis, and help others.
Maintaining the Front Door to Netflix : The Netflix APIDaniel Jacobson
This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.
Presentation from Web2.0 Expo NY 2011.
Good architecture of systems and flexibility of content has also allowed NPR to have the freedom and agility to quickly deploy solid user experience and elegant design to multiple platforms. This presentation will cover how NPR improved the code inside its API to be more efficient, while meeting new and evolving product needs
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2jtXmQR.
Konrad Malawski explains what the word Stream means, then looks at how its protocol works and how one might use it in the real world showing examples using existing implementations. He also peeks into the future, to see what the next steps for such collaborative protocols and the JDK ecosystem are in general. Filmed at qconsf.com.
Konrad Malawski works at Akka @Lightbend & Reactive Streams Committer, where he participated in the Reactive Streams initiative and implemented its Technology Compatibility Kit. He maintains a number of Scala tools (such as sbt-jmh), and frequently speaks on distributed systems and concurrency topics at conferences. He also holds number of titles, including Java One RockStar 2015.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1uRYaAR.
Volker Pacher, Sam Phillips present key differences between relational databases and graph databases, and how they use the later to model a complex domain and to gain insights into their data. Filmed at qconlondon.com.
Sam Phillips is Head of Engineering for eBay's Local Delivery team, bringing super fast delivery to customers in the UK and US. Volker Pacher is a Senior Developer at eBay Local Delivery. Before its acquisition by eBay, he was a member of the core team at Shutl helping to transition from a monolithic application to SOA and introducing new technologies, among them Neo4j.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1bJsakO.
Ben Christensen describes how the Netflix API evolved from a typical one-size-fits-all RESTful API designed to support public developers into a web service platform optimized to handle the diversity and variability of each device and user experience. The talk will also address the challenges involving operations, deployment, performance, fault-tolerance, and rate of innovation at massive scale. Filmed at qconsf.com.
Ben Christensen is a software engineer on the Netflix Edge Services Platform team responsible for fault tolerance, performance, architecture and scale while enabling millions of customers to access the Netflix experience across more than 1,000 different device types. Prior to Netflix, Ben was at Apple making iOS apps and media available on the iTunes store.
As the media leader who first brought a public content API to the market in 2008, NPR continues to innovate and learn about what it means to have flexible content. Our philosophy assumes that to maintain relevancy in an online world media companies need to be adroit at delivering content to multiple channels and disparate platforms. This in turn has lead us to keep a strategic focus on our API development. This positions us not just to meet our distribution needs, but has also helped drive business opportunity and allows for effective design and user experience whether in a browser or on a mobile device. This presentation will share our lessons learned and key metrics around successful creation and use of flexible content – from technology needs to business, editorial and design opportunities in an increasingly fragmented online product landscape.
Ajith Ranabahu, Priti Parikh, Maryam Panahiazar, Amit Sheth and Flora Logan-Klumpler: Kino : Making Semantic Annotations Easier, Presented at 5th Intl Conf on Semantic Computing (ICSC2011), Palo Alto, CA, September 2011.
Extreme Testing with Selenium - @hugs at Jenkins User Conference 2011hugs
Slides from Jason Huggins' talk at Jenkins User Conference, October 2011 in San Francisco. Title: "Extreme Testing with Selenium - In the Cloud and In the Garage."
Why do containers suddenly matter so much when they have been around since 1998? Take a look at the potential of OpenStack's Magnum, Murano and Nova-Docker in the context leveraging the incredible interest in Linux Containers brought about by Docker.
Check out www.stackengine.com to learn more about our excellent container management solution.
The comprehension of very large-scale software system evolution remains a challenging problem due to the sheer amount of time-based (i.e., a sequence of changes) data and its intrinsically complex nature (i.e., heterogeneous changes across the entire system source code). It is a necessary step for program comprehension, as systems are not simply created out of thin air in a bang, but are the sum of many changes over long periods of time, by various actors and due to various circumstances.
We present SYN, a web-based tool that uses versatile visualization and data processing techniques to create scalable depictions of ultra-scale software system evolution. SYN has been successfully applied on several systems versioned on GitHub, including the nearly 20-year history of the Linux operating system, which totals more than one million commits on more than 100k evolving files.
Immutable Infrastructure: Rise of the Machine ImagesC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF.
Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com.
Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsAmazon Web Services
Today’s cutting-edge companies have software release cycles measured in days instead of months. This agility is enabled by the DevOps practice of continuous delivery, which automates building, testing, and deploying all code changes. This automation helps you catch bugs sooner and accelerates developer productivity. In this session, we’ll share the processes that Amazon’s engineers use to practice DevOps and discuss how you can bring these processes to your company by using a new set of AWS tools (AWS CodeCommit, AWS CodePipeline, and AWS CodeDeploy). These services were inspired by Amazon's own internal developer tools and DevOps culture.
ISTA 2019 - Migrating data-intensive microservices from Python to GoNikolay Stoitsev
In order for our systems to scale continuously and be resilient, they need to be constantly evolving. In this talk, I’m going to tell the store of how my team migrated a data-intensive microservice from Python to Go. First, we are going to start with the rationale behind the migration. Then we are going to go over the Python and Go tech stacks that we use. Last but not least, I’m also going to share our approach for migrating the service while running in production, adding new features and making sure there are no regressions.
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.
This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.
NPR's Digital Distribution and Mobile StrategyDaniel Jacobson
The NPR API has been the great enabler to achieve rapid development in the mobile space. That is, because we have our rich and powerful API, our mobile team is free to pursue the development of their mobile products without being encumbered by limited internal development resources. The touch-point between the mobile product and our content is fixed which means the mobile team can focus on design and usability for the specific platform.
These slides demonstrate some of the usage and metrics of the NPR API. In addition to the flow of an NPR story from creation to distribution, I also tried to provide a reasonable sampling of the more popular or interesting implementations.
These slides are from the OpenID UX Summit at Sears in Chicago. We discuss the newly formed Adoption Committee for OpenID, NPR's identity sharing strategy, Sears' OpenID case study, PBS' case study, and the goal towards a federated public media identity.
This presentation shows the same NPR story displayed in a wide range of platforms. The content, through the principles of COPE, is pushed out to all of these destinations through the NPR API. Each destination, meanwhile, uses the appropriate content for that presentation layer.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
13. Scientific Practice
Kuhn’s View
Scientific Revolution
(aka. Paradigm
Shift)
New AssumptionAnomalies from ExperimentsExperiments on Current Assumption
Time
AssumptionAssumption
30. Emerging Focus for the API
Industry• Internal API Consumers
• API Consumer Simplicity Over API Provider Simplicity
• Scaling
• Resiliency
• Tools and Insights
• Testing and Automation
52. Emerging Focus for the API
Industry• Internal API Consumers
• API Consumer Simplicity Over API Provider Simplicity
• Scaling
• Resiliency
• Tools and Insights
• Testing and Automation
89. Benefit to Thinking Versionless
• If you can achieve it, maintenance will be MUCH simpler
• If you cannot, it instills better practices
• Reduces lazy programming
• Results in fewer versions
• Results in a cleaner, less brittle system
REMEMBER: Adding new features typically does not
require a new version
(structural changes and removals do)
90. Recipe for Targeted APIs
API providers that have a:
1. small number of targeted API consumers
2. very close relationships between with API consumers
3. increasing divergence of needs across these API consumers
4. strong desire for optimization by the API consumers
5. optimized APIs offer high value proposition
91. Recipe for Targeted APIs
API providers that have a:
1. small number of targeted API consumers
2. very close relationships between with API consumers
3. increasing divergence of needs across these API consumers
4. strong desire for optimization by the API consumers
5. optimized APIs offer high value proposition
6. a generous helping of chocolate (to keep engineers happy)
123. Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet
126. Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet
Error!
138. Netflix API Focus
• Internal API Consumers
• API Consumer Simplicity Over API Provider Simplicity
• Scaling
• Resiliency
• Tools and Insights
• Testing and Automation
142. The Structure of API Revolutions
Image courtesy of SakeThrajan
@daniel_jacobson
djacobson@netflix.com
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
Editor's Notes
Thomas Kuhn published The Structure of Scientific Revolutions in 1962. The book was pretty controversial at the time, and in fact, still offers some pretty contentious views.
Kuhn describes the predominant view of scientific practice as an effort to continuously “discover” reality. Scientific discoveries are therefore building on top of past discoveries over time, continually getting closer to a comprehensive view of reality
Eventually, in principle, science will discover the full truth about reality.
Kuhn’s view is that science does not “discovery” reality or build on top of past “discoveries”. Rather, he believes that the majority of scientific work (which he labels “normal science”) is focused on puzzle solving on top of an initial assumption.
For example, the common belief centuries ago was that the earth was the center of the universe, with all of the planets and the sun resolving around it (the geocentric view).
Given that assumption, normal science builds hypotheses and performs experiments against it. The hypotheses can be proven or disproven within that context and they can build on each other, but they are not progressing towards an unveiling of reality.
During the course of normal science, however, anomalies are encountered. These anomalies are often cast aside as errors in observation/tests or for some other reason. But over time, they mount up or become too powerful in numbers or significance that they cannot be ignored.
Regarding the geocentric view, the phases of Venus became a very powerful anomaly. This anomaly essentially demonstrates that the shadows and reflections on Venus from the Sun, as well as how it moves throughout the sky, make it impossible for it to revolve around the Earth.
When the anomalies encountered are large or frequent enough, they give rise to a competing point of view, a competing assumption. Hypotheses and experiments begin on the new assumption, typically by scientists who otherwise do not practice normal science on the initial assumption.
The phases of Venus anomaly ultimately surfaced the competing assumption that the Sun is at the center and the Earth is one of many other objects revolving around it (the heliocentric view).
If the competing assumption gains enough traction through revolutionary science and becomes strong enough, there is a scientific revolution where the original assumption is completely overthrown in favor of the new one. Kuhn coined the term “Paradigm Shift” to represent this. It is also important to note that paradigm shifts can take a long time to develop and to conclude. But these shifts are absolute, meaning only one of these paradigms can be the focus of normal science.
To be clear, this revolution is not one where the new paradigm is necessarily better than the old one as neither are truly representing reality. It is just a new assumption that appears to be filling the holes of the original or is more representative of modern thinking. In fact, over time, the new paradigm will likely suffer its own anomalies and could very well fall prey to another competing paradigm.
The pattern described in his view of scientific revolutions is often seen in technological development. Specifically, there are many such cases in the API industry. The following are a few of these examples.
The migration to JSON is mostly due to improved language support and the slenderness of JSON’s payload. As a result, the debate seems to be shifting pretty quickly towards JSON.
Here is a breakdown of formats used by a range of APIs in APIHub. The distribution here clearly shows a shift towards JSON over XML/
Similarly, from this (out of date, but still relevant, especially since the trend is continuing in the same direction) bar graph, more and more public APIs are launching with no XML support at all, only supporting JSON. The trends are clearly showing that the revolution is underway and that JSON is becoming the new paradigm.
Nonetheless, it is important to note that this debate is over-simplified. When you support more and more devices and/or partnerships, it quickly becomes clear that API providers need to be nimble enough to handle a wide range of formats. For example, some devices perform better with hierarchical JSON and others with flat object models, others may require specific XML or JSON schemas, some devices prefer full document delivery and others prefer streaming bits, etc. Finally, this should be a non-issue from an architectural perspectivefor the API providers… Adding new outputs should be easy and if it is not, you have a problem with your system’s architecture.
Another common debate is REST vs. SOAP. Clearly, this comparison demonstrates the ease of REST over SOAP.
And this chart fromProgrammableWeb shows that the ease of use translates into conversions from SOAP to REST. This revolution is in full swing, in fact, to the point where REST is fully entrenched as the current assumption. It is also important to note that going from SOAP to REST is a true revolution (as defined by Kuhn) as it results in throwing out one for the other.
My favorite debate is Open vs. Closed APIs (aka. Public vs. Private).
According to ProgrammableWeb (again), the growth of APIs has been tremendous. The APIs tracked, however, are largely open/public APIs, but that is a small subset of the number of APIs that companies use everyday.
One way to represent this is the use of an iceberg. At the top, above the water line, is the smallest part of the iceberg. This represents the open APIs as both are highly visible. But the majority of the iceberg’s mass is under the water line, where it is not visible. The larger mass underwater equates to the closed APIs, those that you are most likely not aware of account for the vast majority of the APIs that exist along with the vast majority of the traffic and consumption.
In fact, many companies who have opened up their content to the world have seen tremendous traffic from internal services relative to their public feeds or APIs. These five companies all have public APIs, but the overwhelming traffic comes from their branded applications built internally or through direct partnerships. This is another example of a revolution where companies are discounting their focus on public APIs in favor of their private APIs. Revolutions such as these, however, take time…
Ultimately, the audience for the API should be the prime driver for its design. The debates presume an objective right or wrong way, which can only be determined against the audience.
Another way to think about this is that the public API is a product in and of itself. The private API is a means to an end. With the trends towards private APIs, the focus of those APIs should then move towards supporting the business goals.
Again, as the focus becomes more and more towards internal consumption, these are some of the key focal points for API providers moving forward.
To demonstrate these focal points, I will use Netflix as an example.
But first, the question that needs to be considered is whether or not Netflix is an early anomaly that will lead to a revolution…
Or are we just a black sheep?
All of this startedwith the launch of streaming in 2007. At the time, we were only streaming on computer-based players (i.e.. No devices, mobile phones, etc.). Also at this time, the content was also not fully liberated.
Shortly after streaming launched, in 2008, we launched our REST API. I describe it as a One-Size-Fits-All (OSFA) type of implementation because the API itself sets the rules and requires anyone who interfaces with it to adhere to those rules. Everyone is treated the same.
The OSFA API launched to support the 1,000 flowers model. That is, we would plant the seeds in the ground (by providing access to our content) and see what flowers sprout up in the myriad fields throughout the US. The 1,000 flowers are public API developers. At the launch of the public API, the content was fully liberated and the bird was set free to fly around in the open world.
And at launch, the API was exclusively targeted towards and consumed by the 1,000 flowers (i.e.. External developers). So all of the API traffic was coming from them.
Some examples of the flowers…
But as streaming gained more steam…
The API evolved to support more of the devices that were getting built. The 1,000 flowers were still supported as well, but as the devices ramped up, they became a bigger focus. And the bird was now mostly flying around the house with occasional visits to the open world.
With the adoption of the devices, API traffic took off! We went from about 600 million requests per month to about 42 BILLION requests in just two years. And it has grown substantially more since then.
Meanwhile, the balance of requests by audience had completely flipped. Overwhelmingly, the majority of traffic was coming from Netflix-ready devices and a shrinking percentage was from the 1,000 flowers. The rough distribution s now more than 1000-to-1 in favor of internal consumption.
We now have more than 36 million global subscribers in more than 50 countries and territories.
Those subscribers consume more than a billion hours of streaming video a month which accounts for 33% of the peak Internet traffic in the North America.
All 36 million of Netflix’s subscribers are watching shows and movies on virtually any device that has a streaming video screen. We are now on more than 1000 different device types, most of which are supported by the Netflix API, to be discussed throughout this presentation.
As our product has evolved over the years, so has the engineering organization. The organization that develops the product is basically shaped like an hourglass.
In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. There are currently more than 1000 different device types that we support. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
The API is at the center of the hourglass, acting as a broker of data.
Given that position in the stack, it is easy to see that the weight of our product, for better or worse, relies on a solid foundation in the API layer.
So, the API needs to adopt a new set of criteria for supporting the business. We no longer need to focus on attracting external developers, building communities, etc (all exercises that point more towards the top end of the iceberg). Rather, we need focus our efforts towards being the means to the end, or the bottom end of the iceberg. These are the things needed to be great at being the means.
Our new design has been predicated on internal API consumers with a focus towards simplifying the way in which consumers interact with the API, rather than how simple it is for us to administer it.
Again, our API traffic grew tremendously over the last few years.
Today, we are doing more than 2B incoming requests per day. That kind of growth and those kinds of numbers seem great. Who wouldn’t want those numbers, right?
Especially if you are an organization like ESPN serving web pages that have ads on them. If espn.com was serving 2B requests a day, each one of those requests would create impressions for the ad which translates into revenue (and potentially increased CPM at those levels).
But the API traffic is not serving pages with ads. Rather, we are delivering documents like this, in the form of XML…
Or like this, in the form of JSON.
Growth in traffic, especially if it were to continue at this rate, does not directly translate into revenue. Instead, it is more likely to translate into costs. Supporting massive traffic requires major infrastructure to support the load, expenses in delivering the bits, engineering costs to build and support more complex systems, etc.
So our first realization was that we could potentially significantly reduce the chattiness between the devices and the API while maintaining the same or better user experience. Rather than handling 2 billion requests per day, could we have the same UI at 300 million instead? Or less? Could having more optimized delivery of the metadata improve the performance and experience for our customers as well?
For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis?
Different devices have different controlling functions as well. For devices with swipe technologies, such as the iPad, do we need to pre-load a lot of extra titles in case a user swipes the row quickly to see the last of 500 titles in their queue? Or for up-down-left-right controllers, would devices be more optimized by fetching a few items at a time when they are needed? Other devices support voice or hand gestures or pointer technologies. How might those impact the user experience and therefore the metadata needed to support them?
The technical specs on these devices differ greatly. Some have significant memory space while others do not, impacting how much data can be handled at a given time. Processing power and hard-drive space could also play a role in how the UI performs, in turn potentially influencing the optimal way for fetching content from the API. All of these differences could result in different potential optimizations across these devices.
Many UI teams needing metadata means many requests to the API team. In the OSFA world, we essentially needed to funnel these requests and then prioritize them. That means that some teams would need to wait for API work to be done. It also meant that, because they all shared the same endpoints, we were often adding variations to the endpoints resulting in a more complex system as well as a lot of spaghetti code. Make teams wait due to prioritization was exacerbated by the fact that tasks took longer because the technical debt was increasing, causing time to build and test to increase. Moreover, many of the incoming requests were asking us to do more of the same kinds of customizations. This created a spiral that would be very difficult to break out of…
All of these aforementioned issues are essentially anomalies in the current OSFA paradigm. For us, these anomalies carve a path for a revolution (meaning, an opportunity for us to overthrow our current OSFA paradigm with a solution that makes up for the OSFA deficiencies).
We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
The original OSFA API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
And ultimately, it works. The PS3 interface looks like this and was populated by this interaction model.
But we believe this is not the optimal way to handle it. In fact, assembling a UI through many resource-based API calls is akin to pointillism paintings. The picture looks great when fully assembled, but it is done by assembling many points put together in the right way.
We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
If resource-based APIs assemble data like pointillism, experience-based APIs assemble data like a photograph. The experience-based approach captures and delivers it all at once.
Another key point in our new implementation is that we are not versioning this API in the way that most APIs consider versioning. Rather, we have more of a deprecation model.
The problem with versioning, particularly in supporting as many devices at Netflix does, is that many of these devices cannot be updated. And in the case of TVs, for example, they sit on people’s walls for 7-10 years with limited (if even possible) options for updating the app. As a result, any API version that is published that a TV app calls needs to be supported for that long duration.
Ultimately, you may end up supporting a large, and growing, number of API versions. The more you support, the tougher it is to maintain and the greater the burden it places on your innovation. Right now, Netflix has a 1.0, 1.5 and 2.0 API. You can quickly imagine in the next 10 years the possibility of a 3.0, 4.0, 5.0, 6.0, etc., making the codebase daunting, ugly and brittle.
Rather, we wanted/needed to get away from the slippery slope of versioning so we can have a more sustainable model moving forward. So, when a Java API change is needed, rather than spinning up a new version, we add the new methods then work closely with the UI teams so they can adopt the new APIs. Quick adoption enables us to deprecate the old ones quickly.
In terms of revolutions, Netflix may just be a lone anomaly that will be cast away as just that. Given my many conversations with other API providers, however, I suspect that the anomalies encountered with the OSFA APIs are becoming more pervasive. This will likely result in a broader revolution at some point in the future (who knows when…) That said, this design is not for everyone, even if you are experiencing the anomalies that I have discussed. Here is a recipe for those to which something like this could apply…
And don’t forget a generous helping of chocolate for your engineers!
Because the business relies on a stable API, we need to have robust systems for resiliency, scaling and insights. This, along with the practices from a range of other teams, helps us better protect Netflix customers from failures.
At Netflix, we have a range of engineering teams who focus on specific problem sets. Some teams focus on creating rich presentation layers on various devices. Others focus on metadata and algorithms. For the streaming application to work, the metadata from the services needs to make it to the devices. That is where the API comes in. The API essentially acts as a broker, moving the metadata from inside the Netflix system to the devices in customers’ homes.
Given the position of the API within the overall system, the API depends on a large number of underlying systems (only some of which are represented here). Moreover, a large number of devices depend on the API (only some of which are represented here). Sometimes, one of these underlying systems experiences an outage.
In the past, such an outage could result in an outage in the API.
And if that outage cascades to the API, it is likely to have some kind of substantive impact on the devices. The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems.
To protect our customers from this problem, we created Hystrix (which is now available on our Open Source site). Hystrix is a fault tolerance and resiliency wrapper than isolates dependencies and allows us to treat them differently as problems arise.
This is a view of the dashboard that shows some of our dependencies. This dashboard, as well as Turbine, is available in our open source site as well. This dashboard is used as a visualization of the health and traffic of each dependency.
This is a view of asingle circuit.
This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
The blue line represents the traffic trends over the last two minutes for this dependency.
The green number shows the number of successful calls to this dependency over the last two minutes.
The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
The orange number shows the number of calls that have timed out, resulting in fallback responses.
The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
The red number shows the number of exceptions, resulting in fallback responses.
The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
The dashboard also shows host and cluster information for the dependency…
As well as information about our SLAs.
So, going back to the engineering diagram…
If that same service fails today…
We simply disconnect from that service.
And replace it with an appropriate fallback. In some cases, the fallback is a degraded set of data, in other cases it could be a fast fail 5xx response code. In all cases, our goal is to ensure that an issue in one service does not result in queued up requests that can create further latencies and ultimately bring down the entire system.
Ultimately, this allows us to keep our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
As a general practice, Netflix focuses on getting code into production as quickly as possible to expose features to new audiences.
We do spend a lot of effort testing before deploying our code. That said, we do not attempt the futile feat of trying to make our testing bullet-proof. Instead, we have adopted some new techniques to help us learn more about what the new code will look like in production. After all, there is no substitute for the variability and load that the production servers can offer (especially when handling 50k+ requests per second).
First and foremost, we are able to perform these techniques because of the flexibility that the AWS cloud offers us.
We have then built our own tools that enable us to see how healthy a range of environments are, among other things…
Assuming environments are healthy, we then pursuetwo primary techniques that help us get code into production: canary deployments and red/black deployments.
The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase.
We will then introduce a single (or perhaps a few) new server(s) into production running new code. Monitoring the canary servers will show what the new code will look like in production.
We have canary analysis tools to help us understand how healthy the canary codebase is relative to the current codebase. There are a range of dimensions that go into this calculation, but the final score is provided in the dial above. If green, it is ready to go. If yellow, needs more investigation. If red, definitely not ready.
We also use dashboards and charts that show more granular data about the health of the canary.
If the canary encounters problems, it will register in any number of ways.
If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc.
We will then repeat the process until the analysis of canary servers look good.
If the new code looks good in the canary, we can then use a technique that we call red/black deployments to launch the code. Start with red, where production code is running. Fire up a new set of servers (black) equal to the count in red with the new code.
Then switch the pointer to have external requests point to the black servers.
Sometimes errors are encountered at this stage as well…
If a problem is encountered from the black servers, it is easy to rollback quickly by switching the pointer back to red. We will then re-evaluate the new code, debug it, etc.
Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
If the new code looks good in the canary, we can then bring up another set of servers with the new code.
Then we will switch production traffic to the new code.
Then switch the pointer to have external requests draw from the black servers. If everything still looks good, we disable the red servers and the new code becomes the new red servers.
Throughout these steps, we have tools such as this one that show the status of the various deployments around the world.
Again, these are the areas of focus for the Netflix API. Based on the pending revolution around private APIs, I suspect that we will see more companies focus on these things as well.
But keep in mind, these revolutions happen often in technology. We are constantly in a quest for plugging the leaks in our previous systems by replacing them with a new, improved systems. The hope is that the paradigm shift results in fewer or smaller leaks. But make no mistake, there will be leaks and anomalies in the new system!
So don’t get too comfortable with any system that you support. Don’t get married to any technology, guideline, protocol, etc. They are all just means to an end.