Maintaining the Front Door to Netflix : The Netflix API

•Download as PPTX, PDF•

62 likes•70,290 views

This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.

Technology

Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson

There are copious notes attached
to each slide in this presentation.
Please read those notes to get
the full context of the
presentation

Global Streaming Video
for TV Shows and Movies

More than 44 Million Subscribers
More than 40 Countries

Netflix Accounts for ~33% of Peak
Internet Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month

Team Focus:
Build the Best Global Streaming Product
Three aspects of the Streaming Product:
• Non-Member
• Discovery
• Streaming

Key Responsibilities
• Broker data between services and UIs
• Maintain a resilient front-door
• Scale the system vertically and horizontally
• Maintain high velocity

Monolithic Application
In Netflix Data Centers

The bigger the ship…
the slower it turns

Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
Reviews
A/B Test
Engine
Dozens of Dependencies

Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine

2,000,000,000
Requests Per Day to the
Netflix API

30
Distinct Dependent
Services for the Netflix API

~500
Dependency jars Slurped
into the Netflix API

14,000,000,000
Netflix API Calls Per Day to
those Dependent Services

99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month

99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime
Per Month

Call Volume and Health / Last 10 Seconds

Short-Circuited Requests, Delivering Fallbacks

Thread Pool & Task Queue Full, Delivering Fallbacks

Error Rate
# + # + # + # / (# + # + # + # + #) = Error Rate

Requests per Second, Over Last 10 Seconds

Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback

Amazon Auto Scaling Limitations
• Hard to fit policies to variable traffic patterns
(weekday vs weekend)
• Limited control over capacity adjustments
(absolute value or %)

The Impact of AAS Limitations
• Traffic drop can lead to scale downs during
outage
• Performance degradation between new
instance launch and taking traffic
• Excess capacity at peak and trough

Scryer : Predictive Auto Scaling
Not yet…

What is Scryer Doing?
• Evaluating needs based on historical data
– Week over week, month over month metrics
• Adjusts instance minimums based on
algorithms
• Relies on Amazon Auto Scaling for unpredicted
events

Results : Load Average
Reactive
Predictive

Results : Response Latencies
Reactive
Predictive

Zuul
Gatekeeper for the Netflix Streaming Application

Zuul *
• Multi-Region
Resiliency
• Insights
• Stress Testing
• Canary Testing
• Dynamic Routing
• Load Shedding
• Security
• Static Response
Handling
• Authentication
* Most closely resembles an API proxy

All of these approaches are
designed to prevent failures…

But sometimes the best way to
prevent failures is to force them!

I randomly
terminate instances
in production to
identify dormant
failures.
Chaos
Monkey

Chaos
Gorilla
I simulate an
outage of an
entire Amazon
availability zone.

I simulate an
outage in an AWS
region.
Chaos
Kong

I find instances that
don’t adhere to
best practices.
Conformity
Monkey

I extend Conformity
Monkey to find
security violations.
Security
Monkey

I detect unhealthy
instances and
remove them
from service.
Doctor
Monkey

I clean up the
clutter and waste
that runs in the
cloud.
Janitor
Monkey

I induce artificial
delays and errors into
services to determine
how upstream services
will respond.
Latency
Monkey

Testing Philosophy:
Act Fast, React Fast

Current Code
In Production
API Requests from
the Internet

Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet

Current Code
In Production
API Requests from
the Internet
Perfect!

Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production

Error!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production

API Requests from
the Internet
New Code
Getting Prepared for Production

One-Size-Fits-All
API
Request
Request
Request

Courtesy of South Florida Classical Review

Resource-Based API
vs.
Experience-Based API

Resource-Based Requests
• /users/<id>/ratings/title
• /users/<id>/queues
• /users/<id>/queues/instant
• /users/<id>/recommendations
• /catalog/titles/movie
• /catalog/titles/series
• /catalog/people

REST API
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Network Border Network Border

RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
SERVER CODE
CLIENT CODE

RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
DATA GATHERING,
FORMATTING,
AND DELIVERY
USER INTERFACE
RENDERING

Experience-Based Requests
• /ps3/homescreen

JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Groovy Layer

RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
SERVER CODE
CLIENT CODE
CLIENT ADAPTER CODE
(WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER)
Network Border Network Border

RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
DATA GATHERING
DATA FORMATTING
AND DELIVERY
USER INTERFACE
RENDERING
Network Border Network Border

What's hot

Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo

gjuljo

Hands-On Introduction to Kubernetes at LISA17

Ryan Jarvinen

Devops as a service

Saravanan Subburayal

DevOps maturity models Knowit and DASA

Kari Kakkonen

Introduction to Microservices Architecture, Docker, Kubernetes, Istio, Testing Strategies for Microservices based Apps. Security Best Practices. Kanban, DevOps, and SRE. Infrastructure Design Patterns - API Gateway - Service Discovery - Load Balancer - Circuit Breaker - Let-it-Crash Pattern Software Design Patterns - Hexagonal Architecture - Domain Driven Design - Event Sourcing and CQRS - Functional Reactive Programming

Microservices Docker Kubernetes Istio Kanban DevOps SRE

Araf Karsh Hamid

Application modernization patterns with apache kafka, debezium, and kubernete...

Bilgin Ibryam

Devops maturity model

ทวิร พานิชสมบัติ

Apache Kafka is a popular, open-source technology for collecting, processing, and analyzing streaming data in real-time. Amazon MSK is a fully managed service that removes the complexities of managing Kafka clusters so that you can focus on building real-time applications. In this session, we provide an overview of Amazon MSK and then discuss how to get started. We then look at some best practices and top tips, and walk through how to decide whether to choose Amazon MSK, Amazon Kinesis, or a mix of both to address your data streaming use cases.

A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit

Amazon Web Services

Docker Kubernetes Istio

Araf Karsh Hamid

DevOps & SRE at Google Scale

Kaushik Bhattacharya

Welcome to Azure Devops

Alessandro Scardova

Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...

Simplilearn

VMware vSphere Performance Troubleshooting

Dan Brinkmann

Aviation and travel are notoriously vulnerable to social, economic, and political events, as well as the ever-changing expectations of consumers. Coronavirus is just a piece of the challenge. This presentation explores use cases, architectures, and references for Apache Kafka as event streaming technology in the aviation industry, including airline, airports, global distribution systems (GDS), aircraft manufacturers, and more. Examples include Lufthansa, Singapore Airlines, Air France Hop, Amadeus, and more. Technologies include Kafka, Kafka Connect, Kafka Streams, ksqlDB, Machine Learning, Cloud, and more.

Apache Kafka in the Airline, Aviation and Travel Industry

Kai Wähner

Microservices

Stephan Lindauer

DevOps Best Practices

Giragadurai Vallirajan

Introduction to Apache Kafka

AIMDek Technologies

Istio a service mesh

Chandresh Pancholi

Introduction to CICD

Knoldus Inc.

Datacenter migration using vmware

Wilson Erique

What's hot (20)

Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo

Hands-On Introduction to Kubernetes at LISA17

Devops as a service

DevOps maturity models Knowit and DASA

Microservices Docker Kubernetes Istio Kanban DevOps SRE

Application modernization patterns with apache kafka, debezium, and kubernete...

Devops maturity model

A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit

Docker Kubernetes Istio

DevOps & SRE at Google Scale

Welcome to Azure Devops

Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...

VMware vSphere Performance Troubleshooting

Apache Kafka in the Airline, Aviation and Travel Industry

Microservices

DevOps Best Practices

Introduction to Apache Kafka

Istio a service mesh

Introduction to CICD

Datacenter migration using vmware

Viewers also liked

API Revolutions : Netflix's API Redesign

Daniel Jacobson

Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities. This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.

Netflix Edge Engineering Open House Presentations - June 9, 2016

Daniel Jacobson

Maintaining the Netflix Front Door - Presentation at Intuit Meetup

Daniel Jacobson

Most API providers focus on solving all three of the key challenges for APIs: data gathering, data formatting and data delivery. All three of these functions are critical for the success of an API, however, not all should be solved by the API provider. Rather, the API consumers have a strong, vested interest in the formatting and delivery. As a result, API design should be addressed based on the true separation of concerns between the needs of the API provider and the various API consumers. This presentation goes into the separation of concerns. It also goes into depth in how Netflix has solved for this problem through a very different approach to API design. This presentation was given at the following API Meetup in SF: http://www.meetup.com/API-Meetup/events/171255242/

Netflix API - Separation of Concerns

Daniel Jacobson

Last Friday (February 8th), I spoke at the Intelligent Content Conference 2013. When Scott Abel (aka The Content Wrangler) first contacted me to speak at the event, he asked me to speak about my content management and distribution experiences from both NPR and Netflix. The two experiences seemed to him to be an interesting blend for the conference. These are the slides from that presentation. I have applied comments to every slide in this presentation to include the context that I otherwise provided verbally during the talk.

Set Your Content Free! : Case Studies from Netflix and NPR

Daniel Jacobson

RESTful API Design, Second Edition

Apigee | Google Cloud

Netflix marketing plan

Evelyne Otto

The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.

Scaling the Netflix API

Daniel Jacobson

Culture

Reed Hastings

BDX 2016- Monal daxini @ Netflix

Ido Shilon

Netflix’s unique DVD rental service has revolutionized the industry. They successfully took the best of traditional conventions (like physical media, the U.S. Postal Service) and mixed them with new world internet-conventions. They have also effectively managed to discourage competition from both more established businesses and new entrants. The future growth of Netflix as it expands into streaming media, poses challenges in legal, infrastructure/technology, and through additional costs. In order to remain competitive, it is imperative that Netflix partner with companies with global reach to overcome these challenges. This presentation was part of an MBA class assignment to audit and industry in the the technology sector. The presentation has multiple authors listed on the title page. If you would like copies of the executive summary, complete S.W.O.T. analysis, and/or the transcript of the presentation please PRIVATE MESSAGE ME and I will email it to you.

Netflix Case Study

Kikuyu Daniels

Culture (Original 2009 version)

Reed Hastings

A001321289 Strategic Human Resource Development

Gareth Noble

Allianz x api_management_servic_fabric

Michele Danieli

Scaling the Netflix API - OSCON

Daniel Jacobson

APIs for Internal Audiences - Netflix - App Dev Conference

Daniel Jacobson

Tv over the internet

Enrico Mosca

A Network View of Netflix

Joel Samen

Make Your API Catalog Essential with z/OS Connect EE

Teodoro Cipresso

CodeIgniter 3.0

Phil Sturgeon

Viewers also liked (20)

API Revolutions : Netflix's API Redesign

Netflix Edge Engineering Open House Presentations - June 9, 2016

Maintaining the Netflix Front Door - Presentation at Intuit Meetup

Netflix API - Separation of Concerns

Set Your Content Free! : Case Studies from Netflix and NPR

RESTful API Design, Second Edition

Netflix marketing plan

Scaling the Netflix API

Culture

BDX 2016- Monal daxini @ Netflix

Netflix Case Study

Culture (Original 2009 version)

A001321289 Strategic Human Resource Development

Allianz x api_management_servic_fabric

Scaling the Netflix API - OSCON

APIs for Internal Audiences - Netflix - App Dev Conference

Tv over the internet

A Network View of Netflix

Make Your API Catalog Essential with z/OS Connect EE

CodeIgniter 3.0

Similar to Maintaining the Front Door to Netflix : The Netflix API

Scaling the Netflix API - From Atlassian Dev Den

Daniel Jacobson

Maintaining the Front Door to Netflix

Benjamin Schmaus

Techniques for Scaling the Netflix API - QCon SF

Daniel Jacobson

Presentation to ESPN about the Netflix API

Daniel Jacobson

Netflix API - Presentation to PayPal

Daniel Jacobson

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF. Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com. Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.

Immutable Infrastructure: Rise of the Machine Images

C4Media

Oscon2014 Netflix API - Top 10 Lessons Learned

Sangeeta Narayanan

Move Fast;Stay Safe:Developing & Deploying the Netflix API

Sangeeta Narayanan

API Design - When to buck the trend (Webcast)

Apigee | Google Cloud

This is my presentation from the Business of APIs Conference in SF, held by Mashery (http://www.apiconference.com). This talk talks briefly about the history of the Netflix API, then goes into three main categories of scaling: 1. Using the cloud to scale in size and internationally 2. Using Webkit to scale application development in parallel to the flexibility afforded by the API 3. Redesigning the API to improve performance and to downscale the infrastructure as the system scales When viewing these slides, please note that they are almost entirely image-based, so I have added notes for each slide to detail the talking points.

Netflix API : BAPI 2011 Presentation : SF

Daniel Jacobson

Serverless architectures let you build and deploy applications and services with infrastructure resources that require zero administration. In the past, you had to provision and scale servers to run your application code, install and operate distributed databases, and build and run custom software to handle API requests. Now, AWS provides a stack of scalable, fully-managed services that eliminates these operational complexities. In this session, you learn about the concepts and benefits of serverless architectures and the basics of the serverless stack AWS provides (e.g., AWS Lambda and Amazon API Gateway). We discuss use cases such as data processing, website backends, serverless applications and "operational glue". After that, you get practical tips and tricks, best practices, and architecture patterns that you can take back and implement immediately.

AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)

Amazon Web Services

Netapp Michael Galpin

rajivmordani

Netflix morphed from a private datacenter based monolithic application into a cloud based Microservices architecture. This talk highlights the pros and cons of building software applications as suites of independently deployable services, as well as practical approaches for overcoming challenges - especially in the context of an elastic but ephemeral cloud ecosystem. What were the lessons learned while building and managing these services? What are the best practices and anti-patterns?

Pros and Cons of a MicroServices Architecture talk at AWS ReInvent

Sudhir Tonse

Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening

Amazon Web Services

Yazid Boutejder: AWS San Francisco Startup Day, 9/7/17 Operations: Production Readiness Review – how to stop bad things from happening - There is more to deploying code than pushing the deploy button. A good practice that many companies follow is a Production Readiness Review (PRR) which is essentially a pre-flight check list before a service launches. This helps ensure new services are properly architected, monitored, secured, and more. We’ll walk through an example PRR and discuss the value of ensuring each of these is properly taken care of before your service launches.

Operations: Production Readiness

Amazon Web Services

Performance testing material

Keylabstraining Bangalore

To watch, please see: https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.html Starting Your DevOps Journey: Practical Tips for Ops In this webinar, Andreas Grabner, Chief DevOps Activist at Dynatrace, shares practical tips that all IT groups from Dev to Ops can use to start their DevOps journey quickly. With experience from hundreds of DevOps deployments, Andi provides insights it would take your team months or years to learn firsthand. - Learn how everyone on your Ops team can use APM to better understand and monitor SLAs, Performance and End User Impact of their applications. - Foster better collaboration between Ops and architects by extending basic system monitoring to monolith and microservices architectures. - Shift-left your testing and QA by working with metrics that you and the architects agreed on up front, resulting in early relevant feedback and faster code deployments. - Hear why changing the cultural mindset from “fear of change” to “Continuous Innovation and Optimization” is critical for success. Andi is joined by guest speaker, Brian Chandler, Systems Engineer at Raymond James, who shares commonly used Ops dashboards that increase collaboration across IT teams and pro-actively break down silos!

Starting Your DevOps Journey – Practical Tips for Ops

Dynatrace

Corteva Agriscience, the agricultural division of DowDuPont, produces as much DNA sequence data every six hours as existed in the entire public sphere in 2008. On-premises processing and storage could not scale to meet the business demand. Partnering with Sogeti (part of Capgemini), Corteva replatformed their existing Hadoop-based genome processing systems into AWS using a serverless, cloud-native architecture. In this session, learn how Corteva Agriscience met current and future data processing demands without maintaining any long-running servers by using AWS Lambda, Amazon S3, Amazon API Gateway, Amazon EMR, AWS Glue, AWS Batch, and more. This session is brought to you by AWS partner, Capgemini America.

Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...

Amazon Web Services

Gluecon 2013 netflix api crash course

Benjamin Schmaus

5 Years Of Building SaaS On AWS

Christian Beedgen

Similar to Maintaining the Front Door to Netflix : The Netflix API (20)

Scaling the Netflix API - From Atlassian Dev Den

Maintaining the Front Door to Netflix

Techniques for Scaling the Netflix API - QCon SF

Presentation to ESPN about the Netflix API

Netflix API - Presentation to PayPal

Immutable Infrastructure: Rise of the Machine Images

Oscon2014 Netflix API - Top 10 Lessons Learned

Move Fast;Stay Safe:Developing & Deploying the Netflix API

API Design - When to buck the trend (Webcast)

Netflix API : BAPI 2011 Presentation : SF

AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)

Netapp Michael Galpin

Pros and Cons of a MicroServices Architecture talk at AWS ReInvent

Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening

Operations: Production Readiness

Performance testing material

Starting Your DevOps Journey – Practical Tips for Ops

Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...

Gluecon 2013 netflix api crash course

5 Years Of Building SaaS On AWS

More from Daniel Jacobson

Top 10 Lessons Learned from the Netflix API - OSCON 2014

Daniel Jacobson

Many API programs get launched without a clear understanding as to WHY the API should exist. Rather, many are focused on WHAT the API consists of and HOW it should be targeted, implemented and leveraged. This presentation focuses on establishing the need for a clear WHY proposition behind the decision. The HOW and then WHAT will follow from that. This presentation also uses the history of the Netflix API to demonstrate the power, utility and importance of knowing WHY you are building an API.

Why API? - Business of APIs Conference

Daniel Jacobson

Netflix API: Keynote at Disney Tech Conference

Daniel Jacobson

Netflix API

Daniel Jacobson

Redesigning the Netflix API - OSCON

Daniel Jacobson

History and Future of the Netflix API - Mashery Evolution of Distribution

Daniel Jacobson

The future-of-netflix-api

Daniel Jacobson

NPR Presentation at Wolfram Data Summit 2010

Daniel Jacobson

When launching the API at OSCON in 2008, NPR targeted four audiences: the open source community; NPR member stations; NPR partners and vendors; and finally our internal developers and product managers. In its short two-year life, the NPR API has grown tremendously, from only a few hundred thousand requests per month to more than 60M. The API, furthermore, has enabled tremendous growth for NPR in the mobile space while facilitating more than 100% growth in total page views in the last year.

NPR: Digital Distribution Strategy: OSCON2010

Daniel Jacobson

The NPR API has been the great enabler to achieve rapid development in the mobile space. That is, because we have our rich and powerful API, our mobile team is free to pursue the development of their mobile products without being encumbered by limited internal development resources. The touch-point between the mobile product and our content is fixed which means the mobile team can focus on design and usability for the specific platform.

NPR's Digital Distribution and Mobile Strategy

Daniel Jacobson

NPR API Usage and Metrics

Daniel Jacobson

OpenID Adoption UX Summit

Daniel Jacobson

NPR : Examples of COPE

Daniel Jacobson

More from Daniel Jacobson (13)

Top 10 Lessons Learned from the Netflix API - OSCON 2014

Why API? - Business of APIs Conference

Netflix API: Keynote at Disney Tech Conference

Netflix API

Redesigning the Netflix API - OSCON

History and Future of the Netflix API - Mashery Evolution of Distribution

The future-of-netflix-api

NPR Presentation at Wolfram Data Summit 2010

NPR: Digital Distribution Strategy: OSCON2010

NPR's Digital Distribution and Mobile Strategy

NPR API Usage and Metrics

OpenID Adoption UX Summit

NPR : Examples of COPE

Recently uploaded

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Introduction to use of FHIR Documents in ABDM

Kumar Satyam

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

Keynote 2: APIs in 2030: The Risk of Technological Sleepwalk Paolo Malinverno, Growth Advisor - The Business of Technology Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

apidays

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Samir Dash

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

Six Myths about Ontologies: The Basics of Formal Ontology

johnbeverley2021

Retrieval augmented generation (RAG) is the most popular style of large language model application to emerge from 2023. The most basic style of RAG works by vectorizing your data and injecting it into a vector database like Milvus for retrieval to augment the text output generated by an LLM. This is just the beginning. One of the ways that we can extend RAG, and extend AI, is through multilingual use cases. Typical RAG is done in English using embedding models that are trained in English. In this talk, we’ll explore how RAG could work in languages other than English. We’ll explore French, Chinese, and Polish.

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Zilliz

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Vector Search -An Introduction in Oracle Database 23ai.pptx

Remote DBA Services

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

JohnPollard-hybrid-app-RailsConf2024.pptx

JohnPollard37

Tracing the root cause of a performance issue requires a lot of patience, experience, and focus. It’s so hard that we sometimes attempt to guess by trying out tentative fixes, but that usually results in frustration, messy code, and a considerable waste of time and money. This talk explains how to correctly zoom in on a performance bottleneck using three levels of profiling: distributed tracing, metrics, and method profiling. After we learn to read the JVM profiler output as a flame graph, we explore a series of bottlenecks typical for backend systems, like connection/thread pool starvation, invisible aspects, blocking code, hot CPU methods, lock contention, and Virtual Thread pinning, and we learn to trace them even if they occur in library code you are not familiar with. Attend this talk and prepare for the performance issues that will eventually hit any successful system. About authorWith two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Victor Rentea

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024

Introduction to use of FHIR Documents in ABDM

CNIC Information System with Pakdata Cf In Pakistan

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

AWS Community Day CPH - Three problems of Terraform

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Six Myths about Ontologies: The Basics of Formal Ontology

Introduction to Multilingual Retrieval Augmented Generation (RAG)

FWD Group - Insurer Innovation Award 2024

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Vector Search -An Introduction in Oracle Database 23ai.pptx

How to Troubleshoot Apps for the Modern Connected Worker

JohnPollard-hybrid-app-RailsConf2024.pptx

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Maintaining the Front Door to Netflix : The Netflix API

1. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

2. There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation

3. Global Streaming Video for TV Shows and Movies

4. More than 44 Million Subscribers More than 40 Countries

5. Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month

8. Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: • Non-Member • Discovery • Streaming

9. Key Responsibilities • Broker data between services and UIs • Maintain a resilient front-door • Scale the system vertically and horizontally • Maintain high velocity

10. But Before Streaming…

11.

12.

13. Monolithic Application In Netflix Data Centers

14. The bigger the ship… the slower it turns

15. Distributed Architecture

16.

17. 1000+ Device Types

18. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies

19. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

20. Dependency Relationships

21. 2,000,000,000 Requests Per Day to the Netflix API

22. 30 Distinct Dependent Services for the Netflix API

23. ~500 Dependency jars Slurped into the Netflix API

24. 14,000,000,000 Netflix API Calls Per Day to those Dependent Services

25. 0 Dependent Services with 100% SLA

26. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month

27. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month

28. 99.9% = 97%30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month

29. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

30. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

31. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

32. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

33. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

34.

35. Circuit Breaker Dashboard

36.

37. Call Volume and Health / Last 10 Seconds

38. Call Volume / Last 2 Minutes

39. Successful Requests

40. Successful, But Slower Than Expected

41. Short-Circuited Requests, Delivering Fallbacks

42. Timeouts, Delivering Fallbacks

43. Thread Pool & Task Queue Full, Delivering Fallbacks

44. Exceptions, Delivering Fallbacks

45. Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate

46. Status of Fallback Circuit

47. Requests per Second, Over Last 10 Seconds

48. SLA Information

49. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

50. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

51. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

52. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback

53. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback

54. Scaling the Distributed System

55.

56. AWS Cloud

57.

58. Autoscaling

59. Autoscaling

60. Amazon Auto Scaling Limitations • Hard to fit policies to variable traffic patterns (weekday vs weekend) • Limited control over capacity adjustments (absolute value or %)

61. The Impact of AAS Limitations • Traffic drop can lead to scale downs during outage • Performance degradation between new instance launch and taking traffic • Excess capacity at peak and trough

62. Scryer : Predictive Auto Scaling Not yet…

63. Typical Traffic Patterns Over Five Days

64. Predicted RPS Compared to Actual RPS

65. Scaling Plan for Predicted Workload

66. What is Scryer Doing? • Evaluating needs based on historical data – Week over week, month over month metrics • Adjusts instance minimums based on algorithms • Relies on Amazon Auto Scaling for unpredicted events

67. Results

68. Results : Load Average Reactive Predictive

69. Results : Response Latencies Reactive Predictive

70. Results : Outage Recovery

71. Results : Outage Recovery

72. Results : AWS Costs

73. Scaling Globally

74. More than 44 Million Subscribers More than 40 Countries

75. Zuul Gatekeeper for the Netflix Streaming Application

76. Zuul * • Multi-Region Resiliency • Insights • Stress Testing • Canary Testing • Dynamic Routing • Load Shedding • Security • Static Response Handling • Authentication * Most closely resembles an API proxy

77. Isthmus

78.

79. All of these approaches are designed to prevent failures…

80. But sometimes the best way to prevent failures is to force them!

81.

82. I randomly terminate instances in production to identify dormant failures. Chaos Monkey

83. Chaos Gorilla I simulate an outage of an entire Amazon availability zone.

84. I simulate an outage in an AWS region. Chaos Kong

85. I find instances that don’t adhere to best practices. Conformity Monkey

86. I extend Conformity Monkey to find security violations. Security Monkey

87. I detect unhealthy instances and remove them from service. Doctor Monkey

88. I clean up the clutter and waste that runs in the cloud. Janitor Monkey

89. I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey

90.

91. Deployments in the Cloud

92. Dependency Relationships

93.

94. Testing Philosophy: Act Fast, React Fast

95. That Doesn’t Mean We Don’t Test

96. Automated Delivery Pipeline

97. Cloud-Based Deployment Techniques

98. Current Code In Production API Requests from the Internet

99. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet

100. Canary Analysis Automation

101. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!

102. Current Code In Production API Requests from the Internet

103. Current Code In Production API Requests from the Internet

104. Current Code In Production API Requests from the Internet Perfect!

105. Stress Test with Zuul

106. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

107. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

108. Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

109. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

110. Current Code In Production API Requests from the Internet Perfect!

111. Stress Test with Zuul

112. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

113. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

114. API Requests from the Internet New Code Getting Prepared for Production

115. Brokering Data to 1,000+ Device Types

116.

117.

118. Screen Real Estate

119. Controller

120. Technical Capabilities

121. One-Size-Fits-All API Request Request Request

122. Courtesy of South Florida Classical Review

123.

124. Resource-Based API vs. Experience-Based API

125. Resource-Based Requests • /users/<id>/ratings/title • /users/<id>/queues • /users/<id>/queues/instant • /users/<id>/recommendations • /catalog/titles/movie • /catalog/titles/series • /catalog/people

126. REST API RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Network Border Network Border

127. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE

128. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING

129.

130.

131. Experience-Based Requests • /ps3/homescreen

132. JAVA API Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Groovy Layer

133.

134. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border

135. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border

136.

137. https://www.github.com/Netflix

138. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

Editor's Notes

Netflix strives to be the global streaming video leader for TV shows and movies
We now have more than 44 million global subscribers in more than 40 countries
Those subscribers consume more than a billion hours of streaming video a month which accounts for about 33% of the peak Internet traffic in the US.
Our 44 million Netflix subscribers are watching shows and movies on virtually any device that has a streaming video screen. We are now on more than 1,000 different device types.
The subscribers can watch our original shows like Emmy-winning House of Cards.
Within this world, the Edge Engineering team focuses on these three aspects of the streaming product.
Before all of this streaming success, however, our roots were in supporting the DVD business.
The system that ran that business was a large monolithic application that was developed and maintained by many teams.
And that monolithic application ran in data centers that we needed to scale and maintain directly.
And as we all know, the bigger the ship, the slower it turns. That was very much the case for Netflix years ago.
To grow to where we knew we needed to be, Netflix aggressively moved to a distributed architecture.
I like to think of this distributed architecture as being shaped like an hourglass…
In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
The API is at the center of the hourglass, acting as a broker of data.
Our distributed architecture, with the number of systems involved, can get quite complicated. Each of these systems talks to a large number of other systems within our architecture.
Assuming each of the services have SLAs of four nines, that results in more than two hours of downtime per month.
And that is if all services maintain four nines!
If it degrades as far as to three nines, that is almost one day per month of downtime!
So, back to the hourglass…
In the old world, the system was vulnerable to such failures. For example, if one of our dependency services fails…
Such a failure could have resulted in an outage in the API.
And that outage likely would have cascaded to have some kind of substantive impact on the devices.
The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems and to keep them happy.
To solve this problem, we created Hystrix, as wrapping technology that provides fault tolerance in a distributed environment. Hystrix is also open source and available at our github repository.
To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
This is a view of asingle circuit.
This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
The blue line represents the traffic trends over the last two minutes for this dependency.
The green number shows the number of successful calls to this dependency over the last two minutes.
The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
The orange number shows the number of calls that have timed out, resulting in fallback responses.
The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
The red number shows the number of exceptions, resulting in fallback responses.
The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
The dashboard also shows host and cluster information for the dependency.
As well as information about our SLAs.
So, going back to the engineering diagram…
If that same service fails today…
We simply disconnect from that service.
And replace it with an appropriate fallback. The fallback, ideally is a slightly degrade, but useful offering. If we cannot get that, however, we will quickly provide a 5xx response which will help the systems shed load rather than queue things up (which could eventually cause the system as a whole to tip over).
This will keep our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
In addition to the migration to a distributed architecture, we also aggressively moved out of data centers…
And into the cloud.
Instead of spending in data centers, we spend out time in tools such as Asgard, created by Netflix staff, to help us manage our instance types and counts in AWS. Asgard is available in our open source repository at github.
Another feature afforded to us through AWS to help us scale is Autoscaling. This is the Netflix API request rates over a span of time. The red line represents a potential capacity needed in a data center to ensure that the spikes could be handled without spending a ton more than is needed for the really unlikely scenarios.
Through autoscaling, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix).
Instead of reacting to real-time metrics, like load average, to increase/decrease the instance count, we can look at historical patterns in our traffic to figure out what will be needed BEFORE it is needed. We believed we could write algorithms to predict the needs.
This is the result of the algorithms we created for the predictions. The prediction closely matches the actual traffic.
Based on those predictions, we started triggering scaling events. Those scaling events closely matched the traffic patterns as well, although these scaling events (as opposed to the Amazon auto-scaler) preceded the need.
Load average when running Scryer is much smoother.
Smoother load average results in more consistent and faster response times.
Another benefit is how Scryer protects us from outage recovery.
Retries and thundering herd effects are mitigated by consistent provisioning of instances through Scryer.
Meanwhile, because we have more predictable scaling needs that can be provisioned more granularly, our AWS costs go down.
Going global has a different set of scaling challenges. AWS enables us to add instances in new regions that are closer to our customers.
To help us manage our traffic across regions, as well as within given regions, we created Zuul. Zuul is open source in our github repository.
Zuul does a variety of things for us. Zuul fronts our entire streaming application as well as a range of other services within our system.
Moreover, Zuul is the routing engine that we use for Isthmus, which is designed to marshall traffic between regions, for failover, performance or other reasons.
We also leverage multiple regions to help us fail over one region to another, through an effort we call Active-Active.
Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
The army is the Simian Army, which is a fleet of monkeys who are designed to do a variety of things, in an automated way, in our cloud implementation. Chaos Monkey, for example, periodically terminates AWS instances in production to see how the system as a whole will respond once that server disappears. Latency Monkey introduces latencies and errors into a system to see how it responds. The system is too complex to know how things will respond in various circumstances, so the monkeys expose that information to us in a variety of ways. The monkeys are also available in our open source github repository.
Again, the dependency chains in our system are quite complicated.
That is a lot of change in the system!
As a result, our philosophy is to act fast (ie. get code into production as quickly as possible), then react fast (ie. response to issues quickly as they arise).
Two such examples are canary deployments and what we call red/black deployments.
The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new server(s) into production running new code. Monitoring the canary servers will show what the new code will look like in production.
If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
The health of the canary is automated as well, comparing its metrics against the fleet of production servers.
If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc.
We will then repeat the process until the analysis of canary servers look good.
We will then repeat the process until the analysis of canary servers look good.
We also use Zuul to funnel varying degrees of traffic to the canaries to evaluate how much load the canary can take relative to the current production instances. If the RPS, for example, drops, the canary may fail the Zuul stress test.
If the new code looks good in the canary, we can then use a technique that we call red/black deployments to launch the code. Start with red, where production code is running. Fire up a new set of servers (black) equal to the count in red with the new code.
Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
If a problem is encountered from the black servers, it is easy to rollback quickly by switching the pointer back to red. We will then re-evaluate the new code, debug it, etc.
Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
And we will stress the canary again…
If the new code looks good in the canary, we can then bring up another set of servers with the new code.
Then we will switch production traffic to the new code.
If everything still looks good, we disable the red servers and the new code becomes the new red servers.
Most companies focus on a small handful of device implementations, most notably Android and iOS devices.
At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis?
Different devices have different controlling functions as well. For devices with swipe technologies, such as the iPad, do we need to pre-load a lot of extra titles in case a user swipes the row quickly to see the last of 500 titles in their queue? Or for up-down-left-right controllers, would devices be more optimized by fetching a few items at a time when they are needed? Other devices support voice or hand gestures or pointer technologies. How might those impact the user experience and therefore the metadata needed to support them?
The technical specs on these devices differ greatly. Some have significant memory space while others do not, impacting how much data can be handled at a given time. Processing power and hard-drive space could also play a role in how the UI performs, in turn potentially influencing the optimal way for fetching content from the API. All of these differences could result in different potential optimizations across these devices.
Many UI teams needing metadata means many requests to the API team. In the one-size-fits-all API world, we essentially needed to funnel these requests and then prioritize them. That means that some teams would need to wait for API work to be done. It also meant that, because they all shared the same endpoints, we were often adding variations to the endpoints resulting in a more complex system as well as a lot of spaghetti code. Make teams wait due to prioritization was exacerbated by the fact that tasks took longer because the technical debt was increasing, causing time to build and test to increase. Moreover, many of the incoming requests were asking us to do more of the same kinds of customizations. This created a spiral that would be very difficult to break out of…
Many other companies have seen similar issues and have introduced orchestration layers that enable more flexible interaction models.
Odata, HYQL, ql.io, rest.li and others are examples of orchestration layers. They address the same problems that we have seen, but we have approached the solution in a very different way.
We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
The original OSFA API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
In this mode, there is a very clear divide between the Client Code and the Server Code. That divide is the network border.
And the responsibilities have the same distribution as well. The Client Code handles the rendering of the interface (as well as asking the server for data). The Server Code is responsible of gathering, formatting and delivering the data to the UIs.
And ultimately, it works. The PS3 interface looks like this and was populated by this interaction model.
But we believe this is not the optimal way to handle it. In fact, assembling a UI through many resource-based API calls is akin to pointillism paintings. The picture looks great when fully assembled, but it is done by assembling many points put together in the right way.
We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
In an experience-based interaction, the PS3 can potentially make asingle request across the network border to a scripting layer (currently Groovy), in this example to provide the data for the PS3 home screen. The call goes to a very specific, custom endpoint for the PS3 or for a shared UI. The Groovy script then interprets what is needed for the PS3 home screen and triggers a series of calls to the Java API running in the same JVM as the Groovy scripts. The Java API is essentially a series of methods that individually know how to gather the corresponding data from the dependent services. The Java API then returns the data to the Groovy script who then formats and delivers the very specific data back to the PS3.
We also introduced RxJava into this layer to improve our ability to handle concurrency and callbacks. RxJava is open source in our github repository.
In this model, the border between Client Code and Server Code is no longer the network border. It is now back on the server. The Groovy is essentially a client adapter written by the client teams.
And the distribution of work changes as well. The client teams continue to handle UI rendering, but now are also responsible for the formatting and delivery of content. The API team, in terms of the data side of things, is responsible for the data gathering and hand-off to the client adapters. Of course, the API team does many other things, including resiliency, scaling, dependency interactions, etc. This model is essentially a platform for API development.
If resource-based APIs assemble data like pointillism, experience-based APIs assemble data like a photograph. The experience-based approach captures and delivers it all at once.
All of the open source components discussed here, as well as many others, can be found at the Netflix github repository.

Maintaining the Front Door to Netflix : The Netflix API

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Maintaining the Front Door to Netflix : The Netflix API

Similar to Maintaining the Front Door to Netflix : The Netflix API (20)

More from Daniel Jacobson

More from Daniel Jacobson (13)

Recently uploaded

Recently uploaded (20)

Maintaining the Front Door to Netflix : The Netflix API

Editor's Notes