Maintaining the Front Door to Netflix : The Netflix API

•Download as PPTX, PDF•

62 likes•70,295 views

This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.

Technology

Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson

There are copious notes attached
to each slide in this presentation.
Please read those notes to get
the full context of the
presentation

Global Streaming Video
for TV Shows and Movies

More than 44 Million Subscribers
More than 40 Countries

Netflix Accounts for ~33% of Peak
Internet Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month

Team Focus:
Build the Best Global Streaming Product
Three aspects of the Streaming Product:
• Non-Member
• Discovery
• Streaming

Key Responsibilities
• Broker data between services and UIs
• Maintain a resilient front-door
• Scale the system vertically and horizontally
• Maintain high velocity

Monolithic Application
In Netflix Data Centers

The bigger the ship…
the slower it turns

Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
Reviews
A/B Test
Engine
Dozens of Dependencies

Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine

2,000,000,000
Requests Per Day to the
Netflix API

30
Distinct Dependent
Services for the Netflix API

~500
Dependency jars Slurped
into the Netflix API

14,000,000,000
Netflix API Calls Per Day to
those Dependent Services

99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month

99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime
Per Month

Call Volume and Health / Last 10 Seconds

Short-Circuited Requests, Delivering Fallbacks

Thread Pool & Task Queue Full, Delivering Fallbacks

Error Rate
# + # + # + # / (# + # + # + # + #) = Error Rate

Requests per Second, Over Last 10 Seconds

Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback

Amazon Auto Scaling Limitations
• Hard to fit policies to variable traffic patterns
(weekday vs weekend)
• Limited control over capacity adjustments
(absolute value or %)

The Impact of AAS Limitations
• Traffic drop can lead to scale downs during
outage
• Performance degradation between new
instance launch and taking traffic
• Excess capacity at peak and trough

Scryer : Predictive Auto Scaling
Not yet…

What is Scryer Doing?
• Evaluating needs based on historical data
– Week over week, month over month metrics
• Adjusts instance minimums based on
algorithms
• Relies on Amazon Auto Scaling for unpredicted
events

Results : Load Average
Reactive
Predictive

Results : Response Latencies
Reactive
Predictive

Zuul
Gatekeeper for the Netflix Streaming Application

Zuul *
• Multi-Region
Resiliency
• Insights
• Stress Testing
• Canary Testing
• Dynamic Routing
• Load Shedding
• Security
• Static Response
Handling
• Authentication
* Most closely resembles an API proxy

All of these approaches are
designed to prevent failures…

But sometimes the best way to
prevent failures is to force them!

I randomly
terminate instances
in production to
identify dormant
failures.
Chaos
Monkey

Chaos
Gorilla
I simulate an
outage of an
entire Amazon
availability zone.

I simulate an
outage in an AWS
region.
Chaos
Kong

I find instances that
don’t adhere to
best practices.
Conformity
Monkey

I extend Conformity
Monkey to find
security violations.
Security
Monkey

I detect unhealthy
instances and
remove them
from service.
Doctor
Monkey

I clean up the
clutter and waste
that runs in the
cloud.
Janitor
Monkey

I induce artificial
delays and errors into
services to determine
how upstream services
will respond.
Latency
Monkey

Testing Philosophy:
Act Fast, React Fast

Current Code
In Production
API Requests from
the Internet

Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet

Current Code
In Production
API Requests from
the Internet
Perfect!

Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production

Error!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production

API Requests from
the Internet
New Code
Getting Prepared for Production

One-Size-Fits-All
API
Request
Request
Request

Courtesy of South Florida Classical Review

Resource-Based API
vs.
Experience-Based API

Resource-Based Requests
• /users/<id>/ratings/title
• /users/<id>/queues
• /users/<id>/queues/instant
• /users/<id>/recommendations
• /catalog/titles/movie
• /catalog/titles/series
• /catalog/people

REST API
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Network Border Network Border

RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
SERVER CODE
CLIENT CODE

RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
DATA GATHERING,
FORMATTING,
AND DELIVERY
USER INTERFACE
RENDERING

Experience-Based Requests
• /ps3/homescreen

JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Groovy Layer

RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
SERVER CODE
CLIENT CODE
CLIENT ADAPTER CODE
(WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER)
Network Border Network Border

RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
DATA GATHERING
DATA FORMATTING
AND DELIVERY
USER INTERFACE
RENDERING
Network Border Network Border

Salesforce recently invented and deployed a real-time, scalable, terabyte data-level and low false positive personalized anomaly detection system. Anomaly detection on user in-app behavior at terabyte-data scale is extremely challenging because traditional techniques like clustering methods suffer serious production performance issues. Salesforce’s method tackles the traditional challenges through three phases: 1) Leveraging Principal Component Analysis (PCA) to extract high-variance and low-variance feature subsets. The low-variance feature subset is valuable in cybersecurity because we want to determine if a user deviates from his or her stable behavior. The high-variance one is used for dimension reduction; 2) On each feature subset, they build a profile for each user to characterize the user’s baseline behavior and legitimate abnormal behavior; 3) During detection, for each incoming event, their method will compare it with the user’s profile and produce an anomaly score. The computation complexity of the detection module for each incoming event is constant. st cloud computing platforms; the novelty of our user behavior profiling based anomaly detection technique and the challenges of implementing and deploying it with Apache Spark in production. We will also demonstrate how our system outperforms the other traditional machine learning algorithms.

The DevOps Journey

Micro Focus

Micro Focus Software Delivery and Testing Jan De Coster Presentation on the Journey to DevOps in the recent Micro Focus #DevDay Copenhagen. Micro Focus enables enterprise software organizations to build innovative software and accelerate application delivery to meet the needs of the business. Whatever the challenges and infrastructures, our core principle—of reusing what already works to minimize business risk while supporting modern software practices—has positioned our customers to be better prepared to support the digital transformation of the business. Build, test and deliver innovative software faster with less risk. April 2017.

Applications Performance Monitoring with Applications Manager part 1

ManageEngine, Zoho Corporation

Azure DevOps

Juan Fabian

Enabling self-service automation with ServiceNow and Ansible Automation Platform

Michael Ford

Red Hat Ansible Automation Platform allows end-users in the enterprise organization to deploy on-demand workloads and perform common automated tasks in a safe, scalable manner. Additionally, python/powershell module support and RESTful API allows the same users to interact with a familiar interface while launching automated Ansible tasks in the background. One of the most ubiquitous self-service platforms in use today is ServiceNow, and many conversations with Ansible Automation Platform customers focus on ServiceNow integration. In this session, we’ll showcase the ServiceNow and Ansible Automation Platform integration that provides self-service IT scaling for everyone. You’ll see how ServiceNow can be used to kick off a complex cloud deployment with Ansible, how Ansible will manage the life cycle of ServiceNow artifacts, and how Ansible can use the ServiceNow Inventory plug-in to manage Day 2 operations in your cloud environment.

Kappa vs Lambda Architectures and Technology Comparison

Kai Wähner

Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers. This video explores why a single real-time pipeline, called Kappa architecture, is the better fit for many enterprise architectures. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for a Lambda architecture. The main focus of the discussion is on Apache Kafka (and its ecosystem) as the de facto standard for event streaming to process data in motion (the key concept of Kappa), but the video also compares various technologies and vendors such as Confluent, Cloudera, IBM Red Hat, Apache Flink, Apache Pulsar, AWS Kinesis, Amazon MSK, Azure Event Hubs, Google Pub Sub, and more. Video recording of this presentation: https://youtu.be/j7D29eyysDw Further reading: https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/ https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/ https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/

Netflix: A State of Xen - Chaos Monkey & Cassandra

DataStax Academy

Bring the VMware Software-Defined Data Center to Amazon Web Services with VMware Cloud. In this webinar we will dive into the compute, network and storage architecture of the VMware Cloud on AWS solution. We will look at real-world, live applications running in VMware Cloud on AWS which integrate with native AWS services such as S3 and Amazon Relational Database Service. We’ll discuss common deployment scenarios including Hybrid Cloud Architectures and Disaster Recovery and explore how the TCO of these implementations differ in VMware Cloud as compared to on-premises implementations.

Application Performance Monitoring (APM)

Site24x7

CQRS and event sourcing

Jeppe Cramon

MicroServices at Netflix - challenges of scale

Sudhir Tonse

Failure is not an Option - Designing Highly Resilient AWS Systems

Amazon Web Services

Customers moving mission-critical applications to the cloud are seeking guidance to replicate and improve the resiliency of their Tier-1 systems, while simultaneously meeting compliance and regulatory requirements. Natural disasters, internet disruptions, hardware or software failure can lead to events requiring customers to invoke disaster recovery (DR) plans. Join us in this session to learn how to “design for failure” and remain resilient in the event of disaster by designing applications using highly resilient components and service features.

Streaming Visualization

Guido Schmutz

Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data. If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.

Virtualization security

n|u - The Open Security Community

The Role of IAM in Microservices

WSO2

DevOps and Tools

Mohammed Fazuluddin

Performance Engineering Masterclass: Efficient Automation with the Help of SR...

ScyllaDB

Henrik Rexed from Dynatrace walks through how to measure, validate and visualize these SLOs using Prometheus, an open observability platform, to provide concrete examples. Next, you learn how to automate your deployment using Keptn, a cloud-native event-based life-cycle orchestration framework. Discover how it can be used for multi-stage delivery, remediation scenarios, and automating production tasks.

OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...

The Linux Foundation

Safety certification is one of the essential requirements for software to be used in highly regulated industries. Besides technical and compliance issues (such as ISO 26262 vs IEC 611508) transitioning an existing project to become more easily safety certifiable requires significant changes to development practices within an open source project. In this session, we will lay out some challenges of making safety certification achievable in open source and the Xen Project. We will outline the process the Xen Project has followed thus far and highlight lessons learned along the way. The talk will primarily focus on necessary process, tooling changes and community challenges that can prevent progress. We will be offering an in-depth review of how Xen Project is approaching this challenging goal and try to derive lessons for other projects and contributors.

Virtual machine

IGZ Software house

AWS Greengrass V2와 신규 IoT 서비스를 활용한 개방형 edge 소프트웨어 환경 구축 - 이세현 AWS IoT 스페셜리스트 ...

Amazon Web Services Korea

AWS 클라우드 서비스와 기능을 엣지(Edge)까지 확장하기 위해 AWS IoT Greengrass의 신규 버전인 V2가 출시되었습니다. 새롭게 출시된 AWS IoT Greengrass V2의 개선된 구조와 기능들을 소개하고, AWS IoT Greengrass V2를 통해 개방형 edge 플랫폼의 장점을 활용하여 민첩하고 유연하게 edge 소프트웨어를 개발할 수 있는 방법을 소개합니다. 추가적으로 AWS re:Invent 2020에서 출시된 다양한 신규 사물인터넷 서비스들에 대해서도 소개합니다.

What is Application Performance Management?

CA Technologies

Performance Engineering Masterclass: Introduction to Modern Performance

ScyllaDB

Leandro Melendez from Grafana k6 starts off by providing a grounding in current expectations of what performance engineering and load testing entail. This session defines the modern challenges developers face, including continuous performance principles, Service Level Objectives (SLOs), and Service Level Indicators (SLIs). It delineates best practices and provides hands-on examples using Grafana k6, an open source modern load testing tool.

Introduction to WebSockets Presentation

Julien LaPointe

The Top 5 Apache Kafka Use Cases and Architectures in 2022

Kai Wähner

I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe: Kappa Architecture: Kappa goes mainstream to replace Lambda and Batch pipelines (that does not mean that there is no batch processing anymore). Examples: Kafka-powered Kappa architectures from Uber, Disney, Shopify, and Twitter. Hyper-personalized Omnichannel: Retail and customer communication across online and offline channels becomes the new black, including context-specific upselling, recommendations, and location-based services. Examples: Omnichannel Retail and Customer 360 in Real-Time with Apache Kafka. Multi-Cloud Deployments: Business units and IT infrastructures span across regions, continents, and cloud providers. Linking clusters for bi-directional replication of data in real-time becomes crucial for many business models. Examples: Global Kafka deployments. Edge Analytics: Low latency requirements, cost efficiency, or security requirements enforce the deployment of (some) event streaming use cases at the far edge (i.e., outside a data center), for instance, for predictive maintenance and quality assurance on the shop floor level in smart factories. Examples: Edge analytics with Kafka. Real-time Cybersecurity: Situational awareness and threat intelligence need to process massive data in real-time to defend against cyberattacks successfully. The many successful ransomware attacks across the globe in 2021 were a warning for most CIOs. Examples: Cybersecurity for situational awareness and threat intelligence in real-time.

Serverless

lakshman diwaakar

Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...

Michael Allen

New cloud stacks, containers, micro-services, automation and DevOps is driving an explosion of application code and infrastructure complexity. It's now nearly impossible to solve the Digital Application Performance Management challenges with traditional tools and approaches. Hear how we are delivering on our vision for Digital performance management, and how the role of digital virtual assistants might transcend into your enterprise. Meet D.A.V.I.S.

Cloud Native Engineering with SRE and GitOps

Weaveworks

Site reliability engineering (SRE), a model championed by Google, is a software engineering approach to IT operations. For companies striving to become cloud native and adopting modern tools such as Kubernetes, SRE best practices are crucial for success. In this webinar, Brice, one of our seasoned Customer Reliability Engineers will show how to design a fail-proof Kubernetes platform using tried and tested SRE and GitOps methods. He will share best practices on: Increasing performance and ensuring scalability Managing incident responses through disaster recovery Designing for High Availability in Kubernetes Achieving 360 visibility and alerts for your platform

API Revolutions : Netflix's API Redesign

Daniel Jacobson

Netflix Edge Engineering Open House Presentations - June 9, 2016

Daniel Jacobson

Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities. This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.

What's hot

VMware Cloud on AWS -- A Technical Deep Dive PPT

Amazon Web Services

Application Performance Monitoring (APM)

Site24x7

CQRS and event sourcing

Jeppe Cramon

MicroServices at Netflix - challenges of scale

Sudhir Tonse

Failure is not an Option - Designing Highly Resilient AWS Systems

Amazon Web Services

Streaming Visualization

Guido Schmutz

Virtualization security

n|u - The Open Security Community

The Role of IAM in Microservices

WSO2

DevOps and Tools

Mohammed Fazuluddin

Performance Engineering Masterclass: Efficient Automation with the Help of SR...

ScyllaDB

OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...

The Linux Foundation

Virtual machine

IGZ Software house

AWS Greengrass V2와 신규 IoT 서비스를 활용한 개방형 edge 소프트웨어 환경 구축 - 이세현 AWS IoT 스페셜리스트 ...

Amazon Web Services Korea

What is Application Performance Management?

CA Technologies

Performance Engineering Masterclass: Introduction to Modern Performance

ScyllaDB

Introduction to WebSockets Presentation

Julien LaPointe

The Top 5 Apache Kafka Use Cases and Architectures in 2022

Kai Wähner

Serverless

lakshman diwaakar

Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...

Michael Allen

Cloud Native Engineering with SRE and GitOps

Weaveworks

What's hot (20)

VMware Cloud on AWS -- A Technical Deep Dive PPT

Application Performance Monitoring (APM)

CQRS and event sourcing

MicroServices at Netflix - challenges of scale

Failure is not an Option - Designing Highly Resilient AWS Systems

Streaming Visualization

Virtualization security

The Role of IAM in Microservices

DevOps and Tools

Performance Engineering Masterclass: Efficient Automation with the Help of SR...

OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...

Virtual machine

AWS Greengrass V2와 신규 IoT 서비스를 활용한 개방형 edge 소프트웨어 환경 구축 - 이세현 AWS IoT 스페셜리스트 ...

What is Application Performance Management?

Performance Engineering Masterclass: Introduction to Modern Performance

Introduction to WebSockets Presentation

The Top 5 Apache Kafka Use Cases and Architectures in 2022

Serverless

Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...

Cloud Native Engineering with SRE and GitOps

Viewers also liked

API Revolutions : Netflix's API Redesign

Daniel Jacobson

Netflix Edge Engineering Open House Presentations - June 9, 2016

Daniel Jacobson

Maintaining the Netflix Front Door - Presentation at Intuit Meetup

Daniel Jacobson

Netflix API - Separation of Concerns

Daniel Jacobson

Most API providers focus on solving all three of the key challenges for APIs: data gathering, data formatting and data delivery. All three of these functions are critical for the success of an API, however, not all should be solved by the API provider. Rather, the API consumers have a strong, vested interest in the formatting and delivery. As a result, API design should be addressed based on the true separation of concerns between the needs of the API provider and the various API consumers. This presentation goes into the separation of concerns. It also goes into depth in how Netflix has solved for this problem through a very different approach to API design. This presentation was given at the following API Meetup in SF: http://www.meetup.com/API-Meetup/events/171255242/

Set Your Content Free! : Case Studies from Netflix and NPR

Daniel Jacobson

Last Friday (February 8th), I spoke at the Intelligent Content Conference 2013. When Scott Abel (aka The Content Wrangler) first contacted me to speak at the event, he asked me to speak about my content management and distribution experiences from both NPR and Netflix. The two experiences seemed to him to be an interesting blend for the conference. These are the slides from that presentation. I have applied comments to every slide in this presentation to include the context that I otherwise provided verbally during the talk.

RESTful API Design, Second EditionApigee | Google Cloud

Netflix marketing plan

Evelyne Otto

Scaling the Netflix API

Daniel Jacobson

The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.

Culture

Reed Hastings

BDX 2016- Monal daxini @ Netflix

Ido Shilon

Netflix Case Study

Kikuyu Daniels

Netflix’s unique DVD rental service has revolutionized the industry. They successfully took the best of traditional conventions (like physical media, the U.S. Postal Service) and mixed them with new world internet-conventions. They have also effectively managed to discourage competition from both more established businesses and new entrants. The future growth of Netflix as it expands into streaming media, poses challenges in legal, infrastructure/technology, and through additional costs. In order to remain competitive, it is imperative that Netflix partner with companies with global reach to overcome these challenges. This presentation was part of an MBA class assignment to audit and industry in the the technology sector. The presentation has multiple authors listed on the title page. If you would like copies of the executive summary, complete S.W.O.T. analysis, and/or the transcript of the presentation please PRIVATE MESSAGE ME and I will email it to you.

Culture (Original 2009 version)

Reed Hastings

A001321289 Strategic Human Resource DevelopmentGareth Noble

Allianz x api_management_servic_fabric

Michele Danieli

Scaling the Netflix API - OSCON

Daniel Jacobson

APIs for Internal Audiences - Netflix - App Dev Conference

Daniel Jacobson

Tv over the internet

Enrico Mosca

A Network View of NetflixJoel Samen

Make Your API Catalog Essential with z/OS Connect EE

Teodoro Cipresso

CodeIgniter 3.0Phil Sturgeon

Viewers also liked (20)

API Revolutions : Netflix's API Redesign

Netflix Edge Engineering Open House Presentations - June 9, 2016

Maintaining the Netflix Front Door - Presentation at Intuit Meetup

Netflix API - Separation of Concerns

Set Your Content Free! : Case Studies from Netflix and NPR

RESTful API Design, Second Edition

Netflix marketing plan

Scaling the Netflix API

Culture

BDX 2016- Monal daxini @ Netflix

Netflix Case Study

Culture (Original 2009 version)

A001321289 Strategic Human Resource Development

Allianz x api_management_servic_fabric

Scaling the Netflix API - OSCON

APIs for Internal Audiences - Netflix - App Dev Conference

Tv over the internet

A Network View of Netflix

Make Your API Catalog Essential with z/OS Connect EE

CodeIgniter 3.0

Similar to Maintaining the Front Door to Netflix : The Netflix API

Scaling the Netflix API - From Atlassian Dev Den

Daniel Jacobson

Maintaining the Front Door to Netflix

Benjamin Schmaus

Techniques for Scaling the Netflix API - QCon SF

Daniel Jacobson

Presentation to ESPN about the Netflix API

Daniel Jacobson

Netflix API - Presentation to PayPal

Daniel Jacobson

Immutable Infrastructure: Rise of the Machine Images

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF. Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com. Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.

Oscon2014 Netflix API - Top 10 Lessons Learned

Sangeeta Narayanan

Move Fast;Stay Safe:Developing & Deploying the Netflix API

Sangeeta Narayanan

API Design - When to buck the trend (Webcast)Apigee | Google Cloud

Netflix API : BAPI 2011 Presentation : SF

Daniel Jacobson

This is my presentation from the Business of APIs Conference in SF, held by Mashery (http://www.apiconference.com). This talk talks briefly about the history of the Netflix API, then goes into three main categories of scaling: 1. Using the cloud to scale in size and internationally 2. Using Webkit to scale application development in parallel to the flexibility afforded by the API 3. Redesigning the API to improve performance and to downscale the infrastructure as the system scales When viewing these slides, please note that they are almost entirely image-based, so I have added notes for each slide to detail the talking points.

AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)

Amazon Web Services

Serverless architectures let you build and deploy applications and services with infrastructure resources that require zero administration. In the past, you had to provision and scale servers to run your application code, install and operate distributed databases, and build and run custom software to handle API requests. Now, AWS provides a stack of scalable, fully-managed services that eliminates these operational complexities. In this session, you learn about the concepts and benefits of serverless architectures and the basics of the serverless stack AWS provides (e.g., AWS Lambda and Amazon API Gateway). We discuss use cases such as data processing, website backends, serverless applications and "operational glue". After that, you get practical tips and tricks, best practices, and architecture patterns that you can take back and implement immediately.

Netapp Michael Galpinrajivmordani

Pros and Cons of a MicroServices Architecture talk at AWS ReInvent

Sudhir Tonse

Netflix morphed from a private datacenter based monolithic application into a cloud based Microservices architecture. This talk highlights the pros and cons of building software applications as suites of independently deployable services, as well as practical approaches for overcoming challenges - especially in the context of an elastic but ephemeral cloud ecosystem. What were the lessons learned while building and managing these services? What are the best practices and anti-patterns?

Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening

Amazon Web Services

Operations: Production Readiness

Amazon Web Services

Yazid Boutejder: AWS San Francisco Startup Day, 9/7/17 Operations: Production Readiness Review – how to stop bad things from happening - There is more to deploying code than pushing the deploy button. A good practice that many companies follow is a Production Readiness Review (PRR) which is essentially a pre-flight check list before a service launches. This helps ensure new services are properly architected, monitored, secured, and more. We’ll walk through an example PRR and discuss the value of ensuring each of these is properly taken care of before your service launches.

Performance testing materialKeylabstraining Bangalore

Starting Your DevOps Journey – Practical Tips for Ops

Dynatrace

To watch, please see: https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.html Starting Your DevOps Journey: Practical Tips for Ops In this webinar, Andreas Grabner, Chief DevOps Activist at Dynatrace, shares practical tips that all IT groups from Dev to Ops can use to start their DevOps journey quickly. With experience from hundreds of DevOps deployments, Andi provides insights it would take your team months or years to learn firsthand. - Learn how everyone on your Ops team can use APM to better understand and monitor SLAs, Performance and End User Impact of their applications. - Foster better collaboration between Ops and architects by extending basic system monitoring to monolith and microservices architectures. - Shift-left your testing and QA by working with metrics that you and the architects agreed on up front, resulting in early relevant feedback and faster code deployments. - Hear why changing the cultural mindset from “fear of change” to “Continuous Innovation and Optimization” is critical for success. Andi is joined by guest speaker, Brian Chandler, Systems Engineer at Raymond James, who shares commonly used Ops dashboards that increase collaboration across IT teams and pro-actively break down silos!

Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...

Amazon Web Services

Corteva Agriscience, the agricultural division of DowDuPont, produces as much DNA sequence data every six hours as existed in the entire public sphere in 2008. On-premises processing and storage could not scale to meet the business demand. Partnering with Sogeti (part of Capgemini), Corteva replatformed their existing Hadoop-based genome processing systems into AWS using a serverless, cloud-native architecture. In this session, learn how Corteva Agriscience met current and future data processing demands without maintaining any long-running servers by using AWS Lambda, Amazon S3, Amazon API Gateway, Amazon EMR, AWS Glue, AWS Batch, and more. This session is brought to you by AWS partner, Capgemini America.

Gluecon 2013 netflix api crash course

Benjamin Schmaus

5 Years Of Building SaaS On AWS

Christian Beedgen

Similar to Maintaining the Front Door to Netflix : The Netflix API (20)

Scaling the Netflix API - From Atlassian Dev Den

Maintaining the Front Door to Netflix

Techniques for Scaling the Netflix API - QCon SF

Presentation to ESPN about the Netflix API

Netflix API - Presentation to PayPal

Immutable Infrastructure: Rise of the Machine Images

Oscon2014 Netflix API - Top 10 Lessons Learned

Move Fast;Stay Safe:Developing & Deploying the Netflix API

API Design - When to buck the trend (Webcast)

Netflix API : BAPI 2011 Presentation : SF

AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)

Netapp Michael Galpin

Pros and Cons of a MicroServices Architecture talk at AWS ReInvent

Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening

Operations: Production Readiness

Performance testing material

Starting Your DevOps Journey – Practical Tips for Ops

Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...

Gluecon 2013 netflix api crash course

5 Years Of Building SaaS On AWS

More from Daniel Jacobson

Top 10 Lessons Learned from the Netflix API - OSCON 2014Daniel Jacobson

Why API? - Business of APIs Conference

Daniel Jacobson

Many API programs get launched without a clear understanding as to WHY the API should exist. Rather, many are focused on WHAT the API consists of and HOW it should be targeted, implemented and leveraged. This presentation focuses on establishing the need for a clear WHY proposition behind the decision. The HOW and then WHAT will follow from that. This presentation also uses the history of the Netflix API to demonstrate the power, utility and importance of knowing WHY you are building an API.

Netflix API: Keynote at Disney Tech Conference

Daniel Jacobson

Netflix API

Daniel Jacobson

Redesigning the Netflix API - OSCON

Daniel Jacobson

History and Future of the Netflix API - Mashery Evolution of Distribution

Daniel Jacobson

The future-of-netflix-api

Daniel Jacobson

NPR Presentation at Wolfram Data Summit 2010

Daniel Jacobson

NPR: Digital Distribution Strategy: OSCON2010

Daniel Jacobson

When launching the API at OSCON in 2008, NPR targeted four audiences: the open source community; NPR member stations; NPR partners and vendors; and finally our internal developers and product managers. In its short two-year life, the NPR API has grown tremendously, from only a few hundred thousand requests per month to more than 60M. The API, furthermore, has enabled tremendous growth for NPR in the mobile space while facilitating more than 100% growth in total page views in the last year.

NPR's Digital Distribution and Mobile Strategy

Daniel Jacobson

The NPR API has been the great enabler to achieve rapid development in the mobile space. That is, because we have our rich and powerful API, our mobile team is free to pursue the development of their mobile products without being encumbered by limited internal development resources. The touch-point between the mobile product and our content is fixed which means the mobile team can focus on design and usability for the specific platform.

NPR API Usage and Metrics

Daniel Jacobson

OpenID Adoption UX Summit

Daniel Jacobson

NPR : Examples of COPE

Daniel Jacobson

More from Daniel Jacobson (13)

Top 10 Lessons Learned from the Netflix API - OSCON 2014

Why API? - Business of APIs Conference

Netflix API: Keynote at Disney Tech Conference

Netflix API

Redesigning the Netflix API - OSCON

History and Future of the Netflix API - Mashery Evolution of Distribution

The future-of-netflix-api

NPR Presentation at Wolfram Data Summit 2010

NPR: Digital Distribution Strategy: OSCON2010

NPR's Digital Distribution and Mobile Strategy

NPR API Usage and Metrics

OpenID Adoption UX Summit

NPR : Examples of COPE

Recently uploaded

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

The Future of Platform Engineering

Jemma Hussein Allen

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Recently uploaded (20)

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Mission to Decommission: Importance of Decommissioning Products to Increase E...

UiPath Test Automation using UiPath Test Suite series, part 3

DevOps and Testing slides at DASA Connect

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Designing Great Products: The Power of Design and Leadership by Chief Designe...

The Future of Platform Engineering

Key Trends Shaping the Future of Infrastructure.pdf

UiPath Test Automation using UiPath Test Suite series, part 4

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

When stars align: studies in data quality, knowledge graphs, and machine lear...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

How world-class product teams are winning in the AI era by CEO and Founder, P...

Accelerate your Kubernetes clusters with Varnish Caching

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Maintaining the Front Door to Netflix : The Netflix API

1. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

2. There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation

3. Global Streaming Video for TV Shows and Movies

4. More than 44 Million Subscribers More than 40 Countries

5. Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month

8. Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: • Non-Member • Discovery • Streaming

9. Key Responsibilities • Broker data between services and UIs • Maintain a resilient front-door • Scale the system vertically and horizontally • Maintain high velocity

10. But Before Streaming…

11.

12.

13. Monolithic Application In Netflix Data Centers

14. The bigger the ship… the slower it turns

15. Distributed Architecture

16.

17. 1000+ Device Types

18. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies

19. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

20. Dependency Relationships

21. 2,000,000,000 Requests Per Day to the Netflix API

22. 30 Distinct Dependent Services for the Netflix API

23. ~500 Dependency jars Slurped into the Netflix API

24. 14,000,000,000 Netflix API Calls Per Day to those Dependent Services

25. 0 Dependent Services with 100% SLA

26. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month

27. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month

28. 99.9% = 97%30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month

29. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

30. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

31. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

32. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

33. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

34.

35. Circuit Breaker Dashboard

36.

37. Call Volume and Health / Last 10 Seconds

38. Call Volume / Last 2 Minutes

39. Successful Requests

40. Successful, But Slower Than Expected

41. Short-Circuited Requests, Delivering Fallbacks

42. Timeouts, Delivering Fallbacks

43. Thread Pool & Task Queue Full, Delivering Fallbacks

44. Exceptions, Delivering Fallbacks

45. Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate

46. Status of Fallback Circuit

47. Requests per Second, Over Last 10 Seconds

48. SLA Information

49. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

50. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

51. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

52. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback

53. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback

54. Scaling the Distributed System

55.

56. AWS Cloud

57.

58. Autoscaling

59. Autoscaling

60. Amazon Auto Scaling Limitations • Hard to fit policies to variable traffic patterns (weekday vs weekend) • Limited control over capacity adjustments (absolute value or %)

61. The Impact of AAS Limitations • Traffic drop can lead to scale downs during outage • Performance degradation between new instance launch and taking traffic • Excess capacity at peak and trough

62. Scryer : Predictive Auto Scaling Not yet…

63. Typical Traffic Patterns Over Five Days

64. Predicted RPS Compared to Actual RPS

65. Scaling Plan for Predicted Workload

66. What is Scryer Doing? • Evaluating needs based on historical data – Week over week, month over month metrics • Adjusts instance minimums based on algorithms • Relies on Amazon Auto Scaling for unpredicted events

67. Results

68. Results : Load Average Reactive Predictive

69. Results : Response Latencies Reactive Predictive

70. Results : Outage Recovery

71. Results : Outage Recovery

72. Results : AWS Costs

73. Scaling Globally

74. More than 44 Million Subscribers More than 40 Countries

75. Zuul Gatekeeper for the Netflix Streaming Application

76. Zuul * • Multi-Region Resiliency • Insights • Stress Testing • Canary Testing • Dynamic Routing • Load Shedding • Security • Static Response Handling • Authentication * Most closely resembles an API proxy

77. Isthmus

78.

79. All of these approaches are designed to prevent failures…

80. But sometimes the best way to prevent failures is to force them!

81.

82. I randomly terminate instances in production to identify dormant failures. Chaos Monkey

83. Chaos Gorilla I simulate an outage of an entire Amazon availability zone.

84. I simulate an outage in an AWS region. Chaos Kong

85. I find instances that don’t adhere to best practices. Conformity Monkey

86. I extend Conformity Monkey to find security violations. Security Monkey

87. I detect unhealthy instances and remove them from service. Doctor Monkey

88. I clean up the clutter and waste that runs in the cloud. Janitor Monkey

89. I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey

90.

91. Deployments in the Cloud

92. Dependency Relationships

93.

94. Testing Philosophy: Act Fast, React Fast

95. That Doesn’t Mean We Don’t Test

96. Automated Delivery Pipeline

97. Cloud-Based Deployment Techniques

98. Current Code In Production API Requests from the Internet

99. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet

100. Canary Analysis Automation

101. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!

102. Current Code In Production API Requests from the Internet

103. Current Code In Production API Requests from the Internet

104. Current Code In Production API Requests from the Internet Perfect!

105. Stress Test with Zuul

106. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

107. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

108. Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

109. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

110. Current Code In Production API Requests from the Internet Perfect!

111. Stress Test with Zuul

112. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

113. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

114. API Requests from the Internet New Code Getting Prepared for Production

115. Brokering Data to 1,000+ Device Types

116.

117.

118. Screen Real Estate

119. Controller

120. Technical Capabilities

121. One-Size-Fits-All API Request Request Request

122. Courtesy of South Florida Classical Review

123.

124. Resource-Based API vs. Experience-Based API

125. Resource-Based Requests • /users/<id>/ratings/title • /users/<id>/queues • /users/<id>/queues/instant • /users/<id>/recommendations • /catalog/titles/movie • /catalog/titles/series • /catalog/people

126. REST API RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Network Border Network Border

127. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE

128. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING

129.

130.

131. Experience-Based Requests • /ps3/homescreen

132. JAVA API Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Groovy Layer

133.

134. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border

135. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border

136.

137. https://www.github.com/Netflix

138. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

Editor's Notes

Netflix strives to be the global streaming video leader for TV shows and movies
We now have more than 44 million global subscribers in more than 40 countries
Those subscribers consume more than a billion hours of streaming video a month which accounts for about 33% of the peak Internet traffic in the US.
Our 44 million Netflix subscribers are watching shows and movies on virtually any device that has a streaming video screen. We are now on more than 1,000 different device types.
The subscribers can watch our original shows like Emmy-winning House of Cards.
Within this world, the Edge Engineering team focuses on these three aspects of the streaming product.
Before all of this streaming success, however, our roots were in supporting the DVD business.
The system that ran that business was a large monolithic application that was developed and maintained by many teams.
And that monolithic application ran in data centers that we needed to scale and maintain directly.
And as we all know, the bigger the ship, the slower it turns. That was very much the case for Netflix years ago.
To grow to where we knew we needed to be, Netflix aggressively moved to a distributed architecture.
I like to think of this distributed architecture as being shaped like an hourglass…
In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
The API is at the center of the hourglass, acting as a broker of data.
Our distributed architecture, with the number of systems involved, can get quite complicated. Each of these systems talks to a large number of other systems within our architecture.
Assuming each of the services have SLAs of four nines, that results in more than two hours of downtime per month.
And that is if all services maintain four nines!
If it degrades as far as to three nines, that is almost one day per month of downtime!
So, back to the hourglass…
In the old world, the system was vulnerable to such failures. For example, if one of our dependency services fails…
Such a failure could have resulted in an outage in the API.
And that outage likely would have cascaded to have some kind of substantive impact on the devices.
The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems and to keep them happy.
To solve this problem, we created Hystrix, as wrapping technology that provides fault tolerance in a distributed environment. Hystrix is also open source and available at our github repository.
To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
This is a view of asingle circuit.
This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
The blue line represents the traffic trends over the last two minutes for this dependency.
The green number shows the number of successful calls to this dependency over the last two minutes.
The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
The orange number shows the number of calls that have timed out, resulting in fallback responses.
The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
The red number shows the number of exceptions, resulting in fallback responses.
The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
The dashboard also shows host and cluster information for the dependency.
As well as information about our SLAs.
So, going back to the engineering diagram…
If that same service fails today…
We simply disconnect from that service.
And replace it with an appropriate fallback. The fallback, ideally is a slightly degrade, but useful offering. If we cannot get that, however, we will quickly provide a 5xx response which will help the systems shed load rather than queue things up (which could eventually cause the system as a whole to tip over).
This will keep our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
In addition to the migration to a distributed architecture, we also aggressively moved out of data centers…
And into the cloud.
Instead of spending in data centers, we spend out time in tools such as Asgard, created by Netflix staff, to help us manage our instance types and counts in AWS. Asgard is available in our open source repository at github.
Another feature afforded to us through AWS to help us scale is Autoscaling. This is the Netflix API request rates over a span of time. The red line represents a potential capacity needed in a data center to ensure that the spikes could be handled without spending a ton more than is needed for the really unlikely scenarios.
Through autoscaling, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix).
Instead of reacting to real-time metrics, like load average, to increase/decrease the instance count, we can look at historical patterns in our traffic to figure out what will be needed BEFORE it is needed. We believed we could write algorithms to predict the needs.
This is the result of the algorithms we created for the predictions. The prediction closely matches the actual traffic.
Based on those predictions, we started triggering scaling events. Those scaling events closely matched the traffic patterns as well, although these scaling events (as opposed to the Amazon auto-scaler) preceded the need.
Load average when running Scryer is much smoother.
Smoother load average results in more consistent and faster response times.
Another benefit is how Scryer protects us from outage recovery.
Retries and thundering herd effects are mitigated by consistent provisioning of instances through Scryer.
Meanwhile, because we have more predictable scaling needs that can be provisioned more granularly, our AWS costs go down.
Going global has a different set of scaling challenges. AWS enables us to add instances in new regions that are closer to our customers.
To help us manage our traffic across regions, as well as within given regions, we created Zuul. Zuul is open source in our github repository.
Zuul does a variety of things for us. Zuul fronts our entire streaming application as well as a range of other services within our system.
Moreover, Zuul is the routing engine that we use for Isthmus, which is designed to marshall traffic between regions, for failover, performance or other reasons.
We also leverage multiple regions to help us fail over one region to another, through an effort we call Active-Active.
Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
The army is the Simian Army, which is a fleet of monkeys who are designed to do a variety of things, in an automated way, in our cloud implementation. Chaos Monkey, for example, periodically terminates AWS instances in production to see how the system as a whole will respond once that server disappears. Latency Monkey introduces latencies and errors into a system to see how it responds. The system is too complex to know how things will respond in various circumstances, so the monkeys expose that information to us in a variety of ways. The monkeys are also available in our open source github repository.
Again, the dependency chains in our system are quite complicated.
That is a lot of change in the system!
As a result, our philosophy is to act fast (ie. get code into production as quickly as possible), then react fast (ie. response to issues quickly as they arise).
Two such examples are canary deployments and what we call red/black deployments.
The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new server(s) into production running new code. Monitoring the canary servers will show what the new code will look like in production.
If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
The health of the canary is automated as well, comparing its metrics against the fleet of production servers.
If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc.
We will then repeat the process until the analysis of canary servers look good.
We will then repeat the process until the analysis of canary servers look good.
We also use Zuul to funnel varying degrees of traffic to the canaries to evaluate how much load the canary can take relative to the current production instances. If the RPS, for example, drops, the canary may fail the Zuul stress test.
If the new code looks good in the canary, we can then use a technique that we call red/black deployments to launch the code. Start with red, where production code is running. Fire up a new set of servers (black) equal to the count in red with the new code.
Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
If a problem is encountered from the black servers, it is easy to rollback quickly by switching the pointer back to red. We will then re-evaluate the new code, debug it, etc.
Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
And we will stress the canary again…
If the new code looks good in the canary, we can then bring up another set of servers with the new code.
Then we will switch production traffic to the new code.
If everything still looks good, we disable the red servers and the new code becomes the new red servers.
Most companies focus on a small handful of device implementations, most notably Android and iOS devices.
At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis?
Different devices have different controlling functions as well. For devices with swipe technologies, such as the iPad, do we need to pre-load a lot of extra titles in case a user swipes the row quickly to see the last of 500 titles in their queue? Or for up-down-left-right controllers, would devices be more optimized by fetching a few items at a time when they are needed? Other devices support voice or hand gestures or pointer technologies. How might those impact the user experience and therefore the metadata needed to support them?
The technical specs on these devices differ greatly. Some have significant memory space while others do not, impacting how much data can be handled at a given time. Processing power and hard-drive space could also play a role in how the UI performs, in turn potentially influencing the optimal way for fetching content from the API. All of these differences could result in different potential optimizations across these devices.
Many UI teams needing metadata means many requests to the API team. In the one-size-fits-all API world, we essentially needed to funnel these requests and then prioritize them. That means that some teams would need to wait for API work to be done. It also meant that, because they all shared the same endpoints, we were often adding variations to the endpoints resulting in a more complex system as well as a lot of spaghetti code. Make teams wait due to prioritization was exacerbated by the fact that tasks took longer because the technical debt was increasing, causing time to build and test to increase. Moreover, many of the incoming requests were asking us to do more of the same kinds of customizations. This created a spiral that would be very difficult to break out of…
Many other companies have seen similar issues and have introduced orchestration layers that enable more flexible interaction models.
Odata, HYQL, ql.io, rest.li and others are examples of orchestration layers. They address the same problems that we have seen, but we have approached the solution in a very different way.
We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
The original OSFA API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
In this mode, there is a very clear divide between the Client Code and the Server Code. That divide is the network border.
And the responsibilities have the same distribution as well. The Client Code handles the rendering of the interface (as well as asking the server for data). The Server Code is responsible of gathering, formatting and delivering the data to the UIs.
And ultimately, it works. The PS3 interface looks like this and was populated by this interaction model.
But we believe this is not the optimal way to handle it. In fact, assembling a UI through many resource-based API calls is akin to pointillism paintings. The picture looks great when fully assembled, but it is done by assembling many points put together in the right way.
We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
In an experience-based interaction, the PS3 can potentially make asingle request across the network border to a scripting layer (currently Groovy), in this example to provide the data for the PS3 home screen. The call goes to a very specific, custom endpoint for the PS3 or for a shared UI. The Groovy script then interprets what is needed for the PS3 home screen and triggers a series of calls to the Java API running in the same JVM as the Groovy scripts. The Java API is essentially a series of methods that individually know how to gather the corresponding data from the dependent services. The Java API then returns the data to the Groovy script who then formats and delivers the very specific data back to the PS3.
We also introduced RxJava into this layer to improve our ability to handle concurrency and callbacks. RxJava is open source in our github repository.
In this model, the border between Client Code and Server Code is no longer the network border. It is now back on the server. The Groovy is essentially a client adapter written by the client teams.
And the distribution of work changes as well. The client teams continue to handle UI rendering, but now are also responsible for the formatting and delivery of content. The API team, in terms of the data side of things, is responsible for the data gathering and hand-off to the client adapters. Of course, the API team does many other things, including resiliency, scaling, dependency interactions, etc. This model is essentially a platform for API development.
If resource-based APIs assemble data like pointillism, experience-based APIs assemble data like a photograph. The experience-based approach captures and delivers it all at once.
All of the open source components discussed here, as well as many others, can be found at the Netflix github repository.

Maintaining the Front Door to Netflix : The Netflix API

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Maintaining the Front Door to Netflix : The Netflix API

Similar to Maintaining the Front Door to Netflix : The Netflix API (20)

More from Daniel Jacobson

More from Daniel Jacobson (13)

Recently uploaded

Recently uploaded (20)

Maintaining the Front Door to Netflix : The Netflix API

Editor's Notes