This document discusses Netflix's approach to developing and deploying its API in a way that allows it to move fast while staying safe. It focuses on how Netflix uses automation, architecture, and insight to rapidly innovate and scale its API to support over 50 million subscribers across over 40 countries and over 1000 device types. Key aspects include automated testing, red/black deployments, predictive autoscaling, real-time metrics and debugging to enable continuous delivery while maintaining high availability, resiliency and rollback capabilities.
In our deep-dive technical series, we look at the strategic importance of having Baas as part of your API Management solution. Apigee Edge API BaaS enables web and mobile app developers to link their apps to a cloud datastore and provide features including user management, push notifications, geolocation services, and more.
APIs are key to making every business a digital business. Businesses need APIs to connect with partners and customers, at any time, on any device, and to participate in the digital ecosystems. To be digital, a scalable flexible API infrastructure is required.
Watch this Demo of Apigee Edge to learn how to:
- Easily configure and manage new APIs and enforce security with minimal impact to backend services
- Create, manage and monetize API products
- Extend API Services to increase flexibility and tailor to business requirements with JavaScript, Java, Python, and Node.js
- Provide developers easy, yet secure access to explore, test, and deploy APIs
- Use end-to-end visibility across the digital value chain to monitor, measure, and manage success. with unified operational, developer, app performance, and business metrics
Apigee Edge enables digital business acceleration with a unified and complete platform, purpose-built for the digital economy. Edge simplifies managing the entire digital value chain with API Services, Developer Services, and Analytics Services.
Watch Video: https://youtu.be/O_qiZoPswWU
Download Podcast: http://bit.ly/18YbGeS
AWS Summit - Trends in Advanced Monitoring for AWS environmentsAndreas Grabner
Why you have to rethink your monitoring strategy when moving or building apps for new stack cloud based environments:
#1: Why "the old way" of monitoring doesnt work any longer!
#2: How the Cloud and New Stack has transformed Dynatrace!
#3: How Dynatrace Redefined Monitoring for Cloud Applications
Presentation from the developer track at I Love APIs London 2016 featuring Stuart Leeks, Microsoft.
Does orchestration feel like something you want to leave to someone else? Are your APIs and Microservices living in isolation and feeling lonely? This presentation dispenses the buzzwords, dives into the Azure Logic Apps Preview, and will help you begin your journey to being an Orchestration Ninja!
Watch the live demo of Apigee Edge to learn how to:
- Easily configure and manage new APIs and enforce security with minimal impact to backend services
- Create, manage and monetize API products
- Extend API Services to increase flexibility and tailor to business requirements with JavaScript, Java, Python, and Node.js
- Provide developers easy, yet secure access to explore, test, and deploy APIs
Use end-to-end visibility across the digital value chain to monitor, measure, and manage success, with unified operational, developer, app performance, and business metrics
Hear the podcast version here: http://bit.ly/1zzXy2B
ServerLess technology analysis, state of the technology as of December 2018, what needs to be done to build a complete, operational serverless platform for production
A quick overview of API Design Workflow, describing my views on waterfall API design approach, why we've built Apiary a certain way and random notes from the API industry
YAGNI, YMMV and APIs: building a hybrid strategy for your API platform.Diogo Lucas
How do you reconcile the need for a stable public API that will not constantly disrupt your carefully tended ecosystem with you team’s ability to move fast and (eventually) break stuff?
How do you cater for different requirements regarding security and SLAs or to radically different client natures?
Join us for a conversation about how to leverage different API strategies ranging from fast moving intra-pod microservice comms to well maintained public services.
apidays LIVE Paris 2021 - Automating API Documentation by Ajinkya Marudwar, G...apidays
apidays LIVE Paris 2021 - APIs and the Future of Software
December 7, 8 & 9, 2021
Automating API Documentation
Ajinkya Marudwar, Sr. Technical Writer at GS Lab
I Love APIs 2015: Scaling Mobile-focused Microservices at VerizonApigee | Google Cloud
I Love APIs 2015
Vidhya Narayanan, Verizon
Chris Webster, Verizon
https://www.go90.com/learn
Vidhya Narayanan, Director of Engineering, and Christopher Webster, Associate Fellow, Verizon, discuss how Verizon created and launched a mobile-first over the top video platform using over 100 microservices. This session at I Love APIs 2015 covered the architecture for deployment and management of microservices, the technologies used to address scale, availability, and security issues and the pitfalls encountered along the way.
Put down your buzzword bingo cards. Martin Buhr, Creator and CEO of Tyk API Management Platform, is here to tell you why boring really is best when it comes to your API Strategy.
In a tech world that’s brimming with modern technologies (each pushed as the next best thing to watching a couple argue in public), Martin makes his case for simple over sensational when it comes to managing your APIs.
In his 20 minute polemic – ahem, we mean talk, he’ll make you embrace the mundane, savour the humdrum, and see beauty in the blah.
With a tech talk that promises to throw a little history, pop culture, and, most likely, philosophy into the day’s API discussions, it will be nothing if not entertaining. So here’s to boring, but not being bored.
Pain Points In API Development? They’re EverywhereNordic APIs
There’s an inherent tension for organizations doing API development: how to keep both your API developers as well as your infrastructure happy, at the same time. Decoupling front-end and back-end development allows parallel development, and helps keep your front-end, middle-end, and back-end efforts working asynchronously. This speeds progress, but requires far more – and far better – collaboration to be successful. Even an independent developer working with APIs requires good collaboration tools.
In this talk, Abhinav Asthana will provide tips on how to improve in API development using collaboration tools like executable API descriptions, API mock servers, and documentation. He will include specific examples of how companies (such as VMware, Coursera, and AMC Theatres) have used collaboration to attain more agile development, to onboard developers, and to ensure input from all participants/stakeholders.
Cloud computing gives you a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session to understand best practices for scaling your resources from zero to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.
This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.
In our deep-dive technical series, we look at the strategic importance of having Baas as part of your API Management solution. Apigee Edge API BaaS enables web and mobile app developers to link their apps to a cloud datastore and provide features including user management, push notifications, geolocation services, and more.
APIs are key to making every business a digital business. Businesses need APIs to connect with partners and customers, at any time, on any device, and to participate in the digital ecosystems. To be digital, a scalable flexible API infrastructure is required.
Watch this Demo of Apigee Edge to learn how to:
- Easily configure and manage new APIs and enforce security with minimal impact to backend services
- Create, manage and monetize API products
- Extend API Services to increase flexibility and tailor to business requirements with JavaScript, Java, Python, and Node.js
- Provide developers easy, yet secure access to explore, test, and deploy APIs
- Use end-to-end visibility across the digital value chain to monitor, measure, and manage success. with unified operational, developer, app performance, and business metrics
Apigee Edge enables digital business acceleration with a unified and complete platform, purpose-built for the digital economy. Edge simplifies managing the entire digital value chain with API Services, Developer Services, and Analytics Services.
Watch Video: https://youtu.be/O_qiZoPswWU
Download Podcast: http://bit.ly/18YbGeS
AWS Summit - Trends in Advanced Monitoring for AWS environmentsAndreas Grabner
Why you have to rethink your monitoring strategy when moving or building apps for new stack cloud based environments:
#1: Why "the old way" of monitoring doesnt work any longer!
#2: How the Cloud and New Stack has transformed Dynatrace!
#3: How Dynatrace Redefined Monitoring for Cloud Applications
Presentation from the developer track at I Love APIs London 2016 featuring Stuart Leeks, Microsoft.
Does orchestration feel like something you want to leave to someone else? Are your APIs and Microservices living in isolation and feeling lonely? This presentation dispenses the buzzwords, dives into the Azure Logic Apps Preview, and will help you begin your journey to being an Orchestration Ninja!
Watch the live demo of Apigee Edge to learn how to:
- Easily configure and manage new APIs and enforce security with minimal impact to backend services
- Create, manage and monetize API products
- Extend API Services to increase flexibility and tailor to business requirements with JavaScript, Java, Python, and Node.js
- Provide developers easy, yet secure access to explore, test, and deploy APIs
Use end-to-end visibility across the digital value chain to monitor, measure, and manage success, with unified operational, developer, app performance, and business metrics
Hear the podcast version here: http://bit.ly/1zzXy2B
ServerLess technology analysis, state of the technology as of December 2018, what needs to be done to build a complete, operational serverless platform for production
A quick overview of API Design Workflow, describing my views on waterfall API design approach, why we've built Apiary a certain way and random notes from the API industry
YAGNI, YMMV and APIs: building a hybrid strategy for your API platform.Diogo Lucas
How do you reconcile the need for a stable public API that will not constantly disrupt your carefully tended ecosystem with you team’s ability to move fast and (eventually) break stuff?
How do you cater for different requirements regarding security and SLAs or to radically different client natures?
Join us for a conversation about how to leverage different API strategies ranging from fast moving intra-pod microservice comms to well maintained public services.
apidays LIVE Paris 2021 - Automating API Documentation by Ajinkya Marudwar, G...apidays
apidays LIVE Paris 2021 - APIs and the Future of Software
December 7, 8 & 9, 2021
Automating API Documentation
Ajinkya Marudwar, Sr. Technical Writer at GS Lab
I Love APIs 2015: Scaling Mobile-focused Microservices at VerizonApigee | Google Cloud
I Love APIs 2015
Vidhya Narayanan, Verizon
Chris Webster, Verizon
https://www.go90.com/learn
Vidhya Narayanan, Director of Engineering, and Christopher Webster, Associate Fellow, Verizon, discuss how Verizon created and launched a mobile-first over the top video platform using over 100 microservices. This session at I Love APIs 2015 covered the architecture for deployment and management of microservices, the technologies used to address scale, availability, and security issues and the pitfalls encountered along the way.
Put down your buzzword bingo cards. Martin Buhr, Creator and CEO of Tyk API Management Platform, is here to tell you why boring really is best when it comes to your API Strategy.
In a tech world that’s brimming with modern technologies (each pushed as the next best thing to watching a couple argue in public), Martin makes his case for simple over sensational when it comes to managing your APIs.
In his 20 minute polemic – ahem, we mean talk, he’ll make you embrace the mundane, savour the humdrum, and see beauty in the blah.
With a tech talk that promises to throw a little history, pop culture, and, most likely, philosophy into the day’s API discussions, it will be nothing if not entertaining. So here’s to boring, but not being bored.
Pain Points In API Development? They’re EverywhereNordic APIs
There’s an inherent tension for organizations doing API development: how to keep both your API developers as well as your infrastructure happy, at the same time. Decoupling front-end and back-end development allows parallel development, and helps keep your front-end, middle-end, and back-end efforts working asynchronously. This speeds progress, but requires far more – and far better – collaboration to be successful. Even an independent developer working with APIs requires good collaboration tools.
In this talk, Abhinav Asthana will provide tips on how to improve in API development using collaboration tools like executable API descriptions, API mock servers, and documentation. He will include specific examples of how companies (such as VMware, Coursera, and AMC Theatres) have used collaboration to attain more agile development, to onboard developers, and to ensure input from all participants/stakeholders.
Cloud computing gives you a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session to understand best practices for scaling your resources from zero to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.
This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.
apidays LIVE India - Asynchronous and Broadcasting APIs using Kafka by Rohit ...apidays
apidays LIVE India 2021 - Connecting 1.3 billion digital innovators
May 20, 2021
Asynchronous and Broadcasting APIs using Kafka
Rohit Saxena, Software Development Consultant at Guardian Life
Maintaining the Front Door to Netflix : The Netflix APIDaniel Jacobson
This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.
The AWS Workshop Series Online is a series of live webinars designed for IT professionals who are looking to leverage the AWS Cloud to build and transform their business, are new to the AWS Cloud or looking to further expand their skills and expertise. In the 2nd of this series, we will cover 'Build a Website on AWS for Your First 10 Million Users'.
In this talk from DevCon TLV we covered:
● The power of HTML5 APIs and how you can use them in your next modern Web Apps.
● On the server side how you can use: Google Cloud Endpoints to scale your API and gain more productivity.
● We did some live Demos and talked about Big Query interfaces.
A presentation on the Netflix Cloud Architecture and NetflixOSS open source. For the All Things Open 2015 conference in Raleigh 2015/10/19. #ATO2015 #NetflixOSS
Resilient Event Driven Systems With KafkaIccha Sethi
Talk at Craft conf 2018 on how to build resilient event driven systems with Kafka. Walks through an example of building notifications feature for chat app Stride.
The AWS Workshop Series Online is a series of live webinars designed for IT professionals who are looking to leverage the AWS Cloud to build and transform their business, are new to the AWS Cloud or looking to further expand their skills and expertise. In this series, we will cover : "Build a Website on AWS for Your First 10 Million Users".
DevOps is powering the computing environments of tomorrow. When properly configured, the Splunk platform allows us to gain real-time visibility into the velocity, quality, and business impact of DevOps-driven application delivery across all roles, departments, process, and systems. Splunk can be used by DevOps practitioners to provide continuous integration/deployment and the real-time feedback to help the organization with their operational intelligence. Join us for a exciting talk about Splunk’s current approach to DevOps, and for examples of how Splunk is being used by customers today to transform DevOps initiatives.
What does it mean for a big financial company to go large scale to the public cloud? What effect has this on the 200+ teams? What is needed to enable teams migrating their services from an on-premises modular monolith to a microservices architecture based on PCF, while ‘keeping the shop open’? We will share our lessons learned, how we enabled teams, how automation became our friend and what the costs are of full CI/CD. We will show what enables us to go from nothing to production within an hour.
By: @_ht80_ and @rbraam
Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
What is Innovation? How can cloud computing help you innovate? How can you make your applications smarter? Predictive? How can you interpret data and anticipate trends? With AWS Artificial Intelligence Solutions: Machine Learning, Rekognition, Polly; with serverless - Lambda, Step Functions.
Similar to Move Fast;Stay Safe:Developing & Deploying the Netflix API (20)
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
12. Role of API
• Enable rapid innovation
• Conduit for metadata between Devices
and Services
• Implements business logic
• Scale with business
• Maintain resiliency
http://goo.gl/VhokZV
80. Move Fast; Stay Safe
Developing and Deploying the Netflix API
Sangeeta Narayanan
@sangeetan
http://www.linkedin.com/in/sangeetanarayanan
Editor's Notes
Started out as a DVD rental by mail service
Introduced on-demand video streaming over the internet in 2007
Has since expanded internationally
2012 marked a foray into the world of original programming
Shows like HoC & Orange have been received with high acclaim; as evidenced by recent Emmy wins. Strategy is to expand internationally and pursue high quality content to drive engagement and acquisition.
Global expansion, high quality originals and personalized content have fueled rapid subscriber growth.
Netflix now accounts for over 1/3rd of downstream internet traffic in NA at peak. This number has been in the news a lot lately!
Our members can choose to enjoy our service on over 1000 device types.
Edge Engineering operates the services that provide the personalized discovery and streaming experience for our members.
This is an extremely high level view of the Netflix service. API is the internet facing service that all devices connect to to provide the user experience. The API in turn consumes data from several middle-tier services, applies business logic on top of it as needed and provides an abstraction layer for devices to interact with.
The API in effect, acts as a broker of metadata between services and devices. Put another way, almost all product functionality flows through the API.
We are constantly striving for a balance between velocity and availability.
This talk will cover some of the strategies and techniques we employ in our pursuit for the balance between velocity and availability. I will focus on three areas - Architecture, Automation and Insight
Let’s look at a couple of examples of architectural choices that enable velocity and resiliency.
This is an overview of the Netflix Streaming architecture.
Zooming in on the interaction between the API and the devices it serves.
We support over 1000 device types.
Embracing the Differences: http://techblog.netflix.com/2012/07/embracing-differences-inside-netflix.html
Inside the API container
The Dynamic Scripting Platform reduces chattiness and allows API clients to develop and operate endpoints customized to their apps, on top of the API platform. Feature development and operations are distributed in this model; with endpoint dev and ops decoupled from that of the API (assuming the requisite functionality is available in the API).
Move away from resource based API to experience based API
Device teams are able to operate and manage their endpoints independently. This screenshot from our dashboard is showing the activity on various endpoints across all API environments.
API Server stats
Going back to the internals of the API container
Hystrix provides fault tolerance and resiliency by implementing the circuit breaker and bulkheading patterns to protect the API from failures in upstream dependencies.
http://techblog.netflix.com/2012/11/hystrix.html
Global AWS deployment in 3 EC2 regions. Each region has 3 availability zones.
Each region runs a ‘cluster’ of EC2 instances; consisting of one or more ASGs (Auto Scaling Groups). Instances are ephemeral; i.e. they come and go. Software is written to handle the loss of instances.
Eureka maintains a registry of healthy instances for each application and a software load balancer is used to route traffic within the SOA.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
The Simian army simulates various outage scenarios that help us validate that our systems are working as designed w.r.t their ability to handle failures gracefully. They also serve as practice drills for our teams.
Our traffic pattern shows an ebb and flow based on time of day and day of week. We use Amazon’s autoscaling policies to adjust capacity dynamically. This is pretty effective, but we ran into some of its limitations. An example is its inability to handle a traffic surge after an outage.
To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix). Scryer evaluates needs based on historical data (week over week, month over month metrics), adjusts instance minimums based on algorithms, and relies on Amazon Auto Scaling for unpredicted events
This graph shows that Scryer’s predictions are in line with actual RPS. In production, Scryer allows us to get instances into production prior to the need (which is different than Amazon’s reactive autoscaling engine which triggers the ramp up based on immediate need, only needing to wait until server start-up is complete). Because the instances are there in advance, Scryer smooths out load averages and response times, which in turn improves the customer experience.
We want to move fast; but protect ourselves from the dangers of doing so. Automation increases velocity while reducing risk by removing the potential for human error. It also helps to bring consistency and predictability to operations.
Shift the curve so you can go faster without compromising availability
It’s trying to stay on the edge; but with safety guards in place.
We have implemented Continuous Delivery to deal with the need for velocity. Releasing software in a steady stream allows us to go faster, bring predictability to our releases and minimize the risks associated with introducing change.
This is a view of our delivery pipeline. We deploy to internal environments several times a day. Production deployments are less frequents because of our farm sizes and the Red/Black deployment model we follow (details in later slides); but we have the ability to deploy on demand in an automated fashion.
We follow the ‘Operate what you Build’ model where developers are responsible for shepherding their changes all the way through to production. We provide them with the tools necessary to help them gain confidence in the quality of their code. One such tool is the automated Canary Analyzer.
Canary reports are generated at periodic intervals and emailed to the team. They are also available off the dashboard. Canary report showing an overall confidence score of the readiness of that build. This one didn’t do very well.
Details into the problematic metrics that contributed to the poor canary score.
We have a complex web of dependencies. Some problems cannot be caught until we are in Production.
We mitigate that by running a separate dependency-update pipeline. This allows us to validate the latest set of dependencies independent of our own code. This validation goes through all the steps of the normal pipeline; including the canary process. We also have detailed insight into changes that went into each canary; including library and config changes.
The same pipeline is also available to developers for their feature branches so they can test their code in production in isolation.
The same pipeline is also available to developers for their feature branches so they can test their code in production in isolation.
Ready for deployment
In the event that a newly deployed version of the software proves to be problematic, the system can be rolled back to the previous version. The old cluster is kept alive for a few hours so the automation knows what to roll back to. Because of our extensive use of autoscaling, provisioning the clusters accurately is tricky; and having to do it manually across three regions would make rollbacks slow and leave them to prone to error. Even though rollbacks are rare, the cost of getting it wrong is too high.
Dynamic configuration using Archaius allows features to be toggled dynamically. If newly introduced feature proves to be problematic, turning it off is an easy way to restore system health. Archaius is a set of config mgmt APIs based on Apache Common Config lib. This allows configuration changes to be propagated in a matter of minutes; at runtime without requiring app downtime. Configuration properties are multi-dimensional and context aware so their scope can be applied to a specific context e.g. env = Test/Staging/Production or region=us-east/us-west/eu-west etc.
Top: Notification of scheduled deployment emailed to the team.
Bottom: chatbot provides realtime updates
http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html
Realtime dashboard powered by Turbine and Hystrix
We can see an outage in real time - the no. of 5XX errors & latency spiked during the incident. This data is being streamed by hundreds of servers, aggregated using Turbine and streamed to the dashboard.
As service owners, we are responsible for defining and configuring our own alerts. And respond to them at 4am too!
We need to be mindful of the number of metrics we are publishing so we don’t inundate the monitoring systems. That is part of the canary analysis as well.
Our big data pipeline (based on kafka, druid and Suro) powers this console that allows for real debugging and request tracing. http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html
All changes in production are recorded by publishing to a system and can be used for auditing and correlation to production events.
Good architectural practices, automation & tooling and deep insight into our systems allow us to operate resilient systems and go fast at scale. But the key piece that brings it all together and completes the picture is our culture.
Employees have the freedom to make major decisions and act on them without approvals. The counterbalance is the responsibility they assume for the implications of their actions. Management’s job is to set the appropriate context so employees have all the information they need to make the right decisions and judgement calls. This fosters a blameless culture where people feel empowered to take risks.