This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.
Application Modernization: Migrating Mainframe Apps to the Cloud Using SpringVMware Tanzu
SpringOne 2021:
Session Title: Application Modernization: Migrating Mainframe Apps to the Cloud Using Spring
Speaker: Glenn Renfro, Spring Developer at VMware
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Databricks
Salesforce recently invented and deployed a real-time, scalable, terabyte data-level and low false positive personalized anomaly detection system. Anomaly detection on user in-app behavior at terabyte-data scale is extremely challenging because traditional techniques like clustering methods suffer serious production performance issues.
Salesforce’s method tackles the traditional challenges through three phases: 1) Leveraging Principal Component Analysis (PCA) to extract high-variance and low-variance feature subsets. The low-variance feature subset is valuable in cybersecurity because we want to determine if a user deviates from his or her stable behavior. The high-variance one is used for dimension reduction; 2) On each feature subset, they build a profile for each user to characterize the user’s baseline behavior and legitimate abnormal behavior; 3) During detection, for each incoming event, their method will compare it with the user’s profile and produce an anomaly score. The computation complexity of the detection module for each incoming event is constant.
st cloud computing platforms; the novelty of our user behavior profiling based anomaly detection technique and the challenges of implementing and deploying it with Apache Spark in production. We will also demonstrate how our system outperforms the other traditional machine learning algorithms.
Micro Focus Software Delivery and Testing Jan De Coster Presentation on the Journey to DevOps in the recent Micro Focus #DevDay Copenhagen.
Micro Focus enables enterprise software organizations to build innovative software and accelerate application delivery to meet the needs of the business. Whatever the challenges and infrastructures, our core principle—of reusing what already works to minimize business risk while supporting modern software practices—has positioned our customers to be better prepared to support the digital transformation of the business.
Build, test and deliver innovative software faster with less risk.
April 2017.
Explore the components of a simple web app and then see how Applications Manager helps you discover and map relationships between your apps and take a look at the wide range of business critical metrics monitored for every app/ server in your network. We will also explore how to logically group your apps so as to achieve maximum productivity from your IT department.
Enabling self-service automation with ServiceNow and Ansible Automation PlatformMichael Ford
Red Hat Ansible Automation Platform allows end-users in the enterprise organization to deploy on-demand workloads and perform common automated tasks in a safe, scalable manner. Additionally, python/powershell module support and RESTful API allows the same users to interact with a familiar interface while launching automated Ansible tasks in the background. One of the most ubiquitous self-service platforms in use today is ServiceNow, and many conversations with Ansible Automation Platform customers focus on ServiceNow integration.
In this session, we’ll showcase the ServiceNow and Ansible Automation Platform integration that provides self-service IT scaling for everyone. You’ll see how ServiceNow can be used to kick off a complex cloud deployment with Ansible, how Ansible will manage the life cycle of ServiceNow artifacts, and how Ansible can use the ServiceNow Inventory plug-in to manage Day 2 operations in your cloud environment.
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers.
This video explores why a single real-time pipeline, called Kappa architecture, is the better fit for many enterprise architectures. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for a Lambda architecture.
The main focus of the discussion is on Apache Kafka (and its ecosystem) as the de facto standard for event streaming to process data in motion (the key concept of Kappa), but the video also compares various technologies and vendors such as Confluent, Cloudera, IBM Red Hat, Apache Flink, Apache Pulsar, AWS Kinesis, Amazon MSK, Azure Event Hubs, Google Pub Sub, and more.
Video recording of this presentation:
https://youtu.be/j7D29eyysDw
Further reading:
https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/
https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/
https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/
Application Modernization: Migrating Mainframe Apps to the Cloud Using SpringVMware Tanzu
SpringOne 2021:
Session Title: Application Modernization: Migrating Mainframe Apps to the Cloud Using Spring
Speaker: Glenn Renfro, Spring Developer at VMware
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Databricks
Salesforce recently invented and deployed a real-time, scalable, terabyte data-level and low false positive personalized anomaly detection system. Anomaly detection on user in-app behavior at terabyte-data scale is extremely challenging because traditional techniques like clustering methods suffer serious production performance issues.
Salesforce’s method tackles the traditional challenges through three phases: 1) Leveraging Principal Component Analysis (PCA) to extract high-variance and low-variance feature subsets. The low-variance feature subset is valuable in cybersecurity because we want to determine if a user deviates from his or her stable behavior. The high-variance one is used for dimension reduction; 2) On each feature subset, they build a profile for each user to characterize the user’s baseline behavior and legitimate abnormal behavior; 3) During detection, for each incoming event, their method will compare it with the user’s profile and produce an anomaly score. The computation complexity of the detection module for each incoming event is constant.
st cloud computing platforms; the novelty of our user behavior profiling based anomaly detection technique and the challenges of implementing and deploying it with Apache Spark in production. We will also demonstrate how our system outperforms the other traditional machine learning algorithms.
Micro Focus Software Delivery and Testing Jan De Coster Presentation on the Journey to DevOps in the recent Micro Focus #DevDay Copenhagen.
Micro Focus enables enterprise software organizations to build innovative software and accelerate application delivery to meet the needs of the business. Whatever the challenges and infrastructures, our core principle—of reusing what already works to minimize business risk while supporting modern software practices—has positioned our customers to be better prepared to support the digital transformation of the business.
Build, test and deliver innovative software faster with less risk.
April 2017.
Explore the components of a simple web app and then see how Applications Manager helps you discover and map relationships between your apps and take a look at the wide range of business critical metrics monitored for every app/ server in your network. We will also explore how to logically group your apps so as to achieve maximum productivity from your IT department.
Enabling self-service automation with ServiceNow and Ansible Automation PlatformMichael Ford
Red Hat Ansible Automation Platform allows end-users in the enterprise organization to deploy on-demand workloads and perform common automated tasks in a safe, scalable manner. Additionally, python/powershell module support and RESTful API allows the same users to interact with a familiar interface while launching automated Ansible tasks in the background. One of the most ubiquitous self-service platforms in use today is ServiceNow, and many conversations with Ansible Automation Platform customers focus on ServiceNow integration.
In this session, we’ll showcase the ServiceNow and Ansible Automation Platform integration that provides self-service IT scaling for everyone. You’ll see how ServiceNow can be used to kick off a complex cloud deployment with Ansible, how Ansible will manage the life cycle of ServiceNow artifacts, and how Ansible can use the ServiceNow Inventory plug-in to manage Day 2 operations in your cloud environment.
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers.
This video explores why a single real-time pipeline, called Kappa architecture, is the better fit for many enterprise architectures. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for a Lambda architecture.
The main focus of the discussion is on Apache Kafka (and its ecosystem) as the de facto standard for event streaming to process data in motion (the key concept of Kappa), but the video also compares various technologies and vendors such as Confluent, Cloudera, IBM Red Hat, Apache Flink, Apache Pulsar, AWS Kinesis, Amazon MSK, Azure Event Hubs, Google Pub Sub, and more.
Video recording of this presentation:
https://youtu.be/j7D29eyysDw
Further reading:
https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/
https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/
https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/
Bring the VMware Software-Defined Data Center to Amazon Web Services with VMware Cloud. In this webinar we will dive into the compute, network and storage architecture of the VMware Cloud on AWS solution. We will look at real-world, live applications running in VMware Cloud on AWS which integrate with native AWS services such as S3 and Amazon Relational Database Service. We’ll discuss common deployment scenarios including Hybrid Cloud Architectures and Disaster Recovery and explore how the TCO of these implementations differ in VMware Cloud as compared to on-premises implementations.
Learn how Site24x7 gives you end-to-end application performance visibility for your Java, .NET and Ruby web transactions with metrics of all components starting from URLs to SQL queries.
MicroServices at Netflix - challenges of scaleSudhir Tonse
MicroServices has caught on as the design pattern of choice for many companies at scale. While MicroServices and SOA in general have many positives compared to Monolithic apps, it does come with its own challenges - especially when running at scale. These slides were for a 15 min Meetup talk hosted at Cisco
Failure is not an Option - Designing Highly Resilient AWS SystemsAmazon Web Services
Customers moving mission-critical applications to the cloud are seeking guidance to replicate and improve the resiliency of their Tier-1 systems, while simultaneously meeting compliance and regulatory requirements. Natural disasters, internet disruptions, hardware or software failure can lead to events requiring customers to invoke disaster recovery (DR) plans. Join us in this session to learn how to “design for failure” and remain resilient in the event of disaster by designing applications using highly resilient components and service features.
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
This slide deck explores the challenges of securing microservices, best practices to overcome them, and how WSO2 Identity Server can be used in microservice architecture.
Watch webinar recording here: https://wso2.com/library/webinars/2018/09/the-role-of-iam-in-microservices/
Performance Engineering Masterclass: Efficient Automation with the Help of SR...ScyllaDB
Henrik Rexed from Dynatrace walks through how to measure, validate and visualize these SLOs using Prometheus, an open observability platform, to provide concrete examples. Next, you learn how to automate your deployment using Keptn, a cloud-native event-based life-cycle orchestration framework. Discover how it can be used for multi-stage delivery, remediation scenarios, and automating production tasks.
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...The Linux Foundation
Safety certification is one of the essential requirements for software to be used in highly regulated industries. Besides technical and compliance issues (such as ISO 26262 vs IEC 611508) transitioning an existing project to become more easily safety certifiable requires significant changes to development practices within an open source project.
In this session, we will lay out some challenges of making safety certification achievable in open source and the Xen Project. We will outline the process the Xen Project has followed thus far and highlight lessons learned along the way. The talk will primarily focus on necessary process, tooling changes and community challenges that can prevent progress. We will be offering an in-depth review of how Xen Project is approaching this challenging goal and try to derive lessons for other projects and contributors.
AWS Greengrass V2와 신규 IoT 서비스를 활용한 개방형 edge 소프트웨어 환경 구축 - 이세현 AWS IoT 스페셜리스트 ...Amazon Web Services Korea
AWS 클라우드 서비스와 기능을 엣지(Edge)까지 확장하기 위해 AWS IoT Greengrass의 신규 버전인 V2가 출시되었습니다.
새롭게 출시된 AWS IoT Greengrass V2의 개선된 구조와 기능들을 소개하고, AWS IoT Greengrass V2를 통해 개방형 edge 플랫폼의 장점을 활용하여 민첩하고 유연하게 edge 소프트웨어를 개발할 수 있는 방법을 소개합니다. 추가적으로 AWS re:Invent 2020에서 출시된 다양한 신규 사물인터넷 서비스들에 대해서도 소개합니다.
VP of APM, Aruna Ravichandran of CA Technologies walks you through what Application Performance Management really is and discusses some of the major benefits of using the software.
Learn more about APM solutions from CA Technologies at http://www.ca.com/apm
Performance Engineering Masterclass: Introduction to Modern PerformanceScyllaDB
Leandro Melendez from Grafana k6 starts off by providing a grounding in current expectations of what performance engineering and load testing entail. This session defines the modern challenges developers face, including continuous performance principles, Service Level Objectives (SLOs), and Service Level Indicators (SLIs). It delineates best practices and provides hands-on examples using Grafana k6, an open source modern load testing tool.
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:
Kappa Architecture: Kappa goes mainstream to replace Lambda and Batch pipelines (that does not mean that there is no batch processing anymore). Examples: Kafka-powered Kappa architectures from Uber, Disney, Shopify, and Twitter.
Hyper-personalized Omnichannel: Retail and customer communication across online and offline channels becomes the new black, including context-specific upselling, recommendations, and location-based services. Examples: Omnichannel Retail and Customer 360 in Real-Time with Apache Kafka.
Multi-Cloud Deployments: Business units and IT infrastructures span across regions, continents, and cloud providers. Linking clusters for bi-directional replication of data in real-time becomes crucial for many business models. Examples: Global Kafka deployments.
Edge Analytics: Low latency requirements, cost efficiency, or security requirements enforce the deployment of (some) event streaming use cases at the far edge (i.e., outside a data center), for instance, for predictive maintenance and quality assurance on the shop floor level in smart factories. Examples: Edge analytics with Kafka.
Real-time Cybersecurity: Situational awareness and threat intelligence need to process massive data in real-time to defend against cyberattacks successfully. The many successful ransomware attacks across the globe in 2021 were a warning for most CIOs. Examples: Cybersecurity for situational awareness and threat intelligence in real-time.
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Michael Allen
New cloud stacks, containers, micro-services, automation and DevOps is driving an explosion of application code and infrastructure complexity. It's now nearly impossible to solve the Digital Application Performance Management challenges with traditional tools and approaches. Hear how we are delivering on our vision for Digital performance management, and how the role of digital virtual assistants might transcend into your enterprise. Meet D.A.V.I.S.
Cloud Native Engineering with SRE and GitOpsWeaveworks
Site reliability engineering (SRE), a model championed by Google, is a software engineering approach to IT operations. For companies striving to become cloud native and adopting modern tools such as Kubernetes, SRE best practices are crucial for success.
In this webinar, Brice, one of our seasoned Customer Reliability Engineers will show how to design a fail-proof Kubernetes platform using tried and tested SRE and GitOps methods.
He will share best practices on:
Increasing performance and ensuring scalability
Managing incident responses through disaster recovery
Designing for High Availability in Kubernetes
Achieving 360 visibility and alerts for your platform
Revolutions have a common pattern in technology and this is no different for the API space. This presentation discusses that pattern and goes through various API revolutions. It also uses Netflix as an example of how some revolutions evolved and where things may be headed.
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.
This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.
Bring the VMware Software-Defined Data Center to Amazon Web Services with VMware Cloud. In this webinar we will dive into the compute, network and storage architecture of the VMware Cloud on AWS solution. We will look at real-world, live applications running in VMware Cloud on AWS which integrate with native AWS services such as S3 and Amazon Relational Database Service. We’ll discuss common deployment scenarios including Hybrid Cloud Architectures and Disaster Recovery and explore how the TCO of these implementations differ in VMware Cloud as compared to on-premises implementations.
Learn how Site24x7 gives you end-to-end application performance visibility for your Java, .NET and Ruby web transactions with metrics of all components starting from URLs to SQL queries.
MicroServices at Netflix - challenges of scaleSudhir Tonse
MicroServices has caught on as the design pattern of choice for many companies at scale. While MicroServices and SOA in general have many positives compared to Monolithic apps, it does come with its own challenges - especially when running at scale. These slides were for a 15 min Meetup talk hosted at Cisco
Failure is not an Option - Designing Highly Resilient AWS SystemsAmazon Web Services
Customers moving mission-critical applications to the cloud are seeking guidance to replicate and improve the resiliency of their Tier-1 systems, while simultaneously meeting compliance and regulatory requirements. Natural disasters, internet disruptions, hardware or software failure can lead to events requiring customers to invoke disaster recovery (DR) plans. Join us in this session to learn how to “design for failure” and remain resilient in the event of disaster by designing applications using highly resilient components and service features.
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
This slide deck explores the challenges of securing microservices, best practices to overcome them, and how WSO2 Identity Server can be used in microservice architecture.
Watch webinar recording here: https://wso2.com/library/webinars/2018/09/the-role-of-iam-in-microservices/
Performance Engineering Masterclass: Efficient Automation with the Help of SR...ScyllaDB
Henrik Rexed from Dynatrace walks through how to measure, validate and visualize these SLOs using Prometheus, an open observability platform, to provide concrete examples. Next, you learn how to automate your deployment using Keptn, a cloud-native event-based life-cycle orchestration framework. Discover how it can be used for multi-stage delivery, remediation scenarios, and automating production tasks.
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...The Linux Foundation
Safety certification is one of the essential requirements for software to be used in highly regulated industries. Besides technical and compliance issues (such as ISO 26262 vs IEC 611508) transitioning an existing project to become more easily safety certifiable requires significant changes to development practices within an open source project.
In this session, we will lay out some challenges of making safety certification achievable in open source and the Xen Project. We will outline the process the Xen Project has followed thus far and highlight lessons learned along the way. The talk will primarily focus on necessary process, tooling changes and community challenges that can prevent progress. We will be offering an in-depth review of how Xen Project is approaching this challenging goal and try to derive lessons for other projects and contributors.
AWS Greengrass V2와 신규 IoT 서비스를 활용한 개방형 edge 소프트웨어 환경 구축 - 이세현 AWS IoT 스페셜리스트 ...Amazon Web Services Korea
AWS 클라우드 서비스와 기능을 엣지(Edge)까지 확장하기 위해 AWS IoT Greengrass의 신규 버전인 V2가 출시되었습니다.
새롭게 출시된 AWS IoT Greengrass V2의 개선된 구조와 기능들을 소개하고, AWS IoT Greengrass V2를 통해 개방형 edge 플랫폼의 장점을 활용하여 민첩하고 유연하게 edge 소프트웨어를 개발할 수 있는 방법을 소개합니다. 추가적으로 AWS re:Invent 2020에서 출시된 다양한 신규 사물인터넷 서비스들에 대해서도 소개합니다.
VP of APM, Aruna Ravichandran of CA Technologies walks you through what Application Performance Management really is and discusses some of the major benefits of using the software.
Learn more about APM solutions from CA Technologies at http://www.ca.com/apm
Performance Engineering Masterclass: Introduction to Modern PerformanceScyllaDB
Leandro Melendez from Grafana k6 starts off by providing a grounding in current expectations of what performance engineering and load testing entail. This session defines the modern challenges developers face, including continuous performance principles, Service Level Objectives (SLOs), and Service Level Indicators (SLIs). It delineates best practices and provides hands-on examples using Grafana k6, an open source modern load testing tool.
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:
Kappa Architecture: Kappa goes mainstream to replace Lambda and Batch pipelines (that does not mean that there is no batch processing anymore). Examples: Kafka-powered Kappa architectures from Uber, Disney, Shopify, and Twitter.
Hyper-personalized Omnichannel: Retail and customer communication across online and offline channels becomes the new black, including context-specific upselling, recommendations, and location-based services. Examples: Omnichannel Retail and Customer 360 in Real-Time with Apache Kafka.
Multi-Cloud Deployments: Business units and IT infrastructures span across regions, continents, and cloud providers. Linking clusters for bi-directional replication of data in real-time becomes crucial for many business models. Examples: Global Kafka deployments.
Edge Analytics: Low latency requirements, cost efficiency, or security requirements enforce the deployment of (some) event streaming use cases at the far edge (i.e., outside a data center), for instance, for predictive maintenance and quality assurance on the shop floor level in smart factories. Examples: Edge analytics with Kafka.
Real-time Cybersecurity: Situational awareness and threat intelligence need to process massive data in real-time to defend against cyberattacks successfully. The many successful ransomware attacks across the globe in 2021 were a warning for most CIOs. Examples: Cybersecurity for situational awareness and threat intelligence in real-time.
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Michael Allen
New cloud stacks, containers, micro-services, automation and DevOps is driving an explosion of application code and infrastructure complexity. It's now nearly impossible to solve the Digital Application Performance Management challenges with traditional tools and approaches. Hear how we are delivering on our vision for Digital performance management, and how the role of digital virtual assistants might transcend into your enterprise. Meet D.A.V.I.S.
Cloud Native Engineering with SRE and GitOpsWeaveworks
Site reliability engineering (SRE), a model championed by Google, is a software engineering approach to IT operations. For companies striving to become cloud native and adopting modern tools such as Kubernetes, SRE best practices are crucial for success.
In this webinar, Brice, one of our seasoned Customer Reliability Engineers will show how to design a fail-proof Kubernetes platform using tried and tested SRE and GitOps methods.
He will share best practices on:
Increasing performance and ensuring scalability
Managing incident responses through disaster recovery
Designing for High Availability in Kubernetes
Achieving 360 visibility and alerts for your platform
Revolutions have a common pattern in technology and this is no different for the API space. This presentation discusses that pattern and goes through various API revolutions. It also uses Netflix as an example of how some revolutions evolved and where things may be headed.
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.
This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.
Most API providers focus on solving all three of the key challenges for APIs: data gathering, data formatting and data delivery. All three of these functions are critical for the success of an API, however, not all should be solved by the API provider. Rather, the API consumers have a strong, vested interest in the formatting and delivery. As a result, API design should be addressed based on the true separation of concerns between the needs of the API provider and the various API consumers.
This presentation goes into the separation of concerns. It also goes into depth in how Netflix has solved for this problem through a very different approach to API design.
This presentation was given at the following API Meetup in SF:
http://www.meetup.com/API-Meetup/events/171255242/
Set Your Content Free! : Case Studies from Netflix and NPRDaniel Jacobson
Last Friday (February 8th), I spoke at the Intelligent Content Conference 2013. When Scott Abel (aka The Content Wrangler) first contacted me to speak at the event, he asked me to speak about my content management and distribution experiences from both NPR and Netflix. The two experiences seemed to him to be an interesting blend for the conference. These are the slides from that presentation.
I have applied comments to every slide in this presentation to include the context that I otherwise provided verbally during the talk.
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
Netflix’s unique DVD rental service has revolutionized the industry. They successfully took the best of traditional conventions (like physical media, the U.S. Postal Service) and mixed them with new world internet-conventions. They have also effectively managed to discourage competition from both more established businesses and new entrants. The future growth of Netflix as it expands into streaming media, poses challenges in legal, infrastructure/technology, and through additional costs. In order to remain competitive, it is imperative that Netflix partner with companies with global reach to overcome these challenges. This presentation was part of an MBA class assignment to audit and industry in the the technology sector. The presentation has multiple authors listed on the title page. If you would like copies of the executive summary, complete S.W.O.T. analysis, and/or the transcript of the presentation please PRIVATE MESSAGE ME and I will email it to you.
Slides for the AllianzX Cloud Days in Munich 01/03. Presenting API Management as a complementary component of a modern service architecture. More than describing an API Management platform in detail, shows the evolution path of architecture toward open API and cloud service fabric.
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
APIs for Internal Audiences - Netflix - App Dev ConferenceDaniel Jacobson
API programs, typically thought of as a public program to see what public developer communities can build with a company's data, are becoming more and more critical to the success of mobile and device strategies. This presentation takes a look at Netflix's and NPR's strategies that lead to tremendous growth and discusses how Netflix plans to take this internal API strategy to the next level.
Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.
Talk about the Netflix API and how it serves as the front door for Netflix device UIs. Topics include: API design, resiliency patterns, scalability, and enabling fast dev/deploy cycles.
Techniques for Scaling the Netflix API - QCon SFDaniel Jacobson
This presentation was from QCon SF 2011. In these slides I discuss various techniques that we use to scale the API. I also discuss in more detail our effort around redesigning the API.
This is a presentation that I gave to ESPN's Digital Media team about the trajectory of the Netflix API. I also discussed Netflix's device implementation strategy and how it enables rapid development and robust A/B testing.
I gave this presentation to the engineering team at PayPal. This presentation discusses the history and future of the Netflix API. It also goes into API design principles as well as concepts behind system scalability and resiliency.
Immutable Infrastructure: Rise of the Machine ImagesC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF.
Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com.
Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.
This is my presentation from the Business of APIs Conference in SF, held by Mashery (http://www.apiconference.com).
This talk talks briefly about the history of the Netflix API, then goes into three main categories of scaling:
1. Using the cloud to scale in size and internationally
2. Using Webkit to scale application development in parallel to the flexibility afforded by the API
3. Redesigning the API to improve performance and to downscale the infrastructure as the system scales
When viewing these slides, please note that they are almost entirely image-based, so I have added notes for each slide to detail the talking points.
AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)Amazon Web Services
Serverless architectures let you build and deploy applications and services with infrastructure resources that require zero administration. In the past, you had to provision and scale servers to run your application code, install and operate distributed databases, and build and run custom software to handle API requests. Now, AWS provides a stack of scalable, fully-managed services that eliminates these operational complexities.
In this session, you learn about the concepts and benefits of serverless architectures and the basics of the serverless stack AWS provides (e.g., AWS Lambda and Amazon API Gateway). We discuss use cases such as data processing, website backends, serverless applications and "operational glue". After that, you get practical tips and tricks, best practices, and architecture patterns that you can take back and implement immediately.
Pros and Cons of a MicroServices Architecture talk at AWS ReInventSudhir Tonse
Netflix morphed from a private datacenter based monolithic application into a cloud based Microservices architecture. This talk highlights the pros and cons of building software applications as suites of independently deployable services, as well as practical approaches for overcoming challenges - especially in the context of an elastic but ephemeral cloud ecosystem. What were the lessons learned while building and managing these services? What are the best practices and anti-patterns?
Yazid Boutejder: AWS San Francisco Startup Day, 9/7/17
Operations: Production Readiness Review – how to stop bad things from happening - There is more to deploying code than pushing the deploy button. A good practice that many companies follow is a Production Readiness Review (PRR) which is essentially a pre-flight check list before a service launches. This helps ensure new services are properly architected, monitored, secured, and more. We’ll walk through an example PRR and discuss the value of ensuring each of these is properly taken care of before your service launches.
Starting Your DevOps Journey – Practical Tips for OpsDynatrace
To watch, please see:
https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.html
Starting Your DevOps Journey: Practical Tips for Ops
In this webinar, Andreas Grabner, Chief DevOps Activist at Dynatrace, shares practical tips that all IT groups from Dev to Ops can use to start their DevOps journey quickly. With experience from hundreds of DevOps deployments, Andi provides insights it would take your team months or years to learn firsthand.
- Learn how everyone on your Ops team can use APM to better understand and monitor SLAs, Performance and End User Impact of their applications.
- Foster better collaboration between Ops and architects by extending basic system monitoring to monolith and microservices architectures.
- Shift-left your testing and QA by working with metrics that you and the architects agreed on up front, resulting in early relevant feedback and faster code deployments.
- Hear why changing the cultural mindset from “fear of change” to “Continuous Innovation and Optimization” is critical for success.
Andi is joined by guest speaker, Brian Chandler, Systems Engineer at Raymond James, who shares commonly used Ops dashboards that increase collaboration across IT teams and pro-actively break down silos!
Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...Amazon Web Services
Corteva Agriscience, the agricultural division of DowDuPont, produces as much DNA sequence data every six hours as existed in the entire public sphere in 2008. On-premises processing and storage could not scale to meet the business demand. Partnering with Sogeti (part of Capgemini), Corteva replatformed their existing Hadoop-based genome processing systems into AWS using a serverless, cloud-native architecture. In this session, learn how Corteva Agriscience met current and future data processing demands without maintaining any long-running servers by using AWS Lambda, Amazon S3, Amazon API Gateway, Amazon EMR, AWS Glue, AWS Batch, and more. This session is brought to you by AWS partner, Capgemini America.
Christian's part of the AWS re:Invent 2015 talk shared with Sajee Mathew - ARC304 - Designing for SaaS: Next Generation Software Delivery Models on AWS. Full video of the 60 minute presentation: https://www.youtube.com/watch?v=d16aUztH9hk&list=PLhr1KZpdzukdRxs_pGJm-qSy5LayL6W_Y
Similar to Maintaining the Front Door to Netflix : The Netflix API (20)
Many API programs get launched without a clear understanding as to WHY the API should exist. Rather, many are focused on WHAT the API consists of and HOW it should be targeted, implemented and leveraged. This presentation focuses on establishing the need for a clear WHY proposition behind the decision. The HOW and then WHAT will follow from that.
This presentation also uses the history of the Netflix API to demonstrate the power, utility and importance of knowing WHY you are building an API.
Netflix API: Keynote at Disney Tech ConferenceDaniel Jacobson
Disney held the first in a series of internal technical conferences in Orlando, FL, this one focused entirely on APIs. These slides are from my keynote presentation which kicked off the event. The slides focus on the Netflix API, API design, anti-patterns, technical revolutions, resiliency, scaling, test frameworks and other constructs that support the Netflix infrastructure.
History and Future of the Netflix API - Mashery Evolution of DistributionDaniel Jacobson
Presentation on the history and future of the Netflix API. This presentation walks through how the API was formed, why it needs a redesign and some of the principles that will be applied in the redesign effort.
This presentation was given at the Mashery Evolution of Distribution session in San Francisco on June 2, 2011.
This presentation demonstrates the great successes of the Netflix API to date. After some introspection, however, there is an opportunity to better prepare the API for the future. This presentation also offers a few ideas on how the Netflix API architecture may change over time.
NPR: Digital Distribution Strategy: OSCON2010Daniel Jacobson
When launching the API at OSCON in 2008, NPR targeted four audiences: the open source community; NPR member stations; NPR partners and vendors; and finally our internal developers and product managers. In its short two-year life, the NPR API has grown tremendously, from only a few hundred thousand requests per month to more than 60M. The API, furthermore, has enabled tremendous growth for NPR in the mobile space while facilitating more than 100% growth in total page views in the last year.
NPR's Digital Distribution and Mobile StrategyDaniel Jacobson
The NPR API has been the great enabler to achieve rapid development in the mobile space. That is, because we have our rich and powerful API, our mobile team is free to pursue the development of their mobile products without being encumbered by limited internal development resources. The touch-point between the mobile product and our content is fixed which means the mobile team can focus on design and usability for the specific platform.
These slides demonstrate some of the usage and metrics of the NPR API. In addition to the flow of an NPR story from creation to distribution, I also tried to provide a reasonable sampling of the more popular or interesting implementations.
These slides are from the OpenID UX Summit at Sears in Chicago. We discuss the newly formed Adoption Committee for OpenID, NPR's identity sharing strategy, Sears' OpenID case study, PBS' case study, and the goal towards a federated public media identity.
This presentation shows the same NPR story displayed in a wide range of platforms. The content, through the principles of COPE, is pushed out to all of these destinations through the NPR API. Each destination, meanwhile, uses the appropriate content for that presentation layer.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Maintaining the Front Door to Netflix : The Netflix API
1. Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
2. There are copious notes attached
to each slide in this presentation.
Please read those notes to get
the full context of the
presentation
4. More than 44 Million Subscribers
More than 40 Countries
5. Netflix Accounts for ~33% of Peak
Internet Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month
6.
7.
8. Team Focus:
Build the Best Global Streaming Product
Three aspects of the Streaming Product:
• Non-Member
• Discovery
• Streaming
9. Key Responsibilities
• Broker data between services and UIs
• Maintain a resilient front-door
• Scale the system vertically and horizontally
• Maintain high velocity
60. Amazon Auto Scaling Limitations
• Hard to fit policies to variable traffic patterns
(weekday vs weekend)
• Limited control over capacity adjustments
(absolute value or %)
61. The Impact of AAS Limitations
• Traffic drop can lead to scale downs during
outage
• Performance degradation between new
instance launch and taking traffic
• Excess capacity at peak and trough
66. What is Scryer Doing?
• Evaluating needs based on historical data
– Week over week, month over month metrics
• Adjusts instance minimums based on
algorithms
• Relies on Amazon Auto Scaling for unpredicted
events
99. Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet
101. Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet
Error!
138. Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
Editor's Notes
Netflix strives to be the global streaming video leader for TV shows and movies
We now have more than 44 million global subscribers in more than 40 countries
Those subscribers consume more than a billion hours of streaming video a month which accounts for about 33% of the peak Internet traffic in the US.
Our 44 million Netflix subscribers are watching shows and movies on virtually any device that has a streaming video screen. We are now on more than 1,000 different device types.
The subscribers can watch our original shows like Emmy-winning House of Cards.
Within this world, the Edge Engineering team focuses on these three aspects of the streaming product.
Before all of this streaming success, however, our roots were in supporting the DVD business.
The system that ran that business was a large monolithic application that was developed and maintained by many teams.
And that monolithic application ran in data centers that we needed to scale and maintain directly.
And as we all know, the bigger the ship, the slower it turns. That was very much the case for Netflix years ago.
To grow to where we knew we needed to be, Netflix aggressively moved to a distributed architecture.
I like to think of this distributed architecture as being shaped like an hourglass…
In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
The API is at the center of the hourglass, acting as a broker of data.
Our distributed architecture, with the number of systems involved, can get quite complicated. Each of these systems talks to a large number of other systems within our architecture.
Assuming each of the services have SLAs of four nines, that results in more than two hours of downtime per month.
And that is if all services maintain four nines!
If it degrades as far as to three nines, that is almost one day per month of downtime!
So, back to the hourglass…
In the old world, the system was vulnerable to such failures. For example, if one of our dependency services fails…
Such a failure could have resulted in an outage in the API.
And that outage likely would have cascaded to have some kind of substantive impact on the devices.
The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems and to keep them happy.
To solve this problem, we created Hystrix, as wrapping technology that provides fault tolerance in a distributed environment. Hystrix is also open source and available at our github repository.
To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
This is a view of asingle circuit.
This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
The blue line represents the traffic trends over the last two minutes for this dependency.
The green number shows the number of successful calls to this dependency over the last two minutes.
The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
The orange number shows the number of calls that have timed out, resulting in fallback responses.
The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
The red number shows the number of exceptions, resulting in fallback responses.
The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
The dashboard also shows host and cluster information for the dependency.
As well as information about our SLAs.
So, going back to the engineering diagram…
If that same service fails today…
We simply disconnect from that service.
And replace it with an appropriate fallback. The fallback, ideally is a slightly degrade, but useful offering. If we cannot get that, however, we will quickly provide a 5xx response which will help the systems shed load rather than queue things up (which could eventually cause the system as a whole to tip over).
This will keep our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
In addition to the migration to a distributed architecture, we also aggressively moved out of data centers…
And into the cloud.
Instead of spending in data centers, we spend out time in tools such as Asgard, created by Netflix staff, to help us manage our instance types and counts in AWS. Asgard is available in our open source repository at github.
Another feature afforded to us through AWS to help us scale is Autoscaling. This is the Netflix API request rates over a span of time. The red line represents a potential capacity needed in a data center to ensure that the spikes could be handled without spending a ton more than is needed for the really unlikely scenarios.
Through autoscaling, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix).
Instead of reacting to real-time metrics, like load average, to increase/decrease the instance count, we can look at historical patterns in our traffic to figure out what will be needed BEFORE it is needed. We believed we could write algorithms to predict the needs.
This is the result of the algorithms we created for the predictions. The prediction closely matches the actual traffic.
Based on those predictions, we started triggering scaling events. Those scaling events closely matched the traffic patterns as well, although these scaling events (as opposed to the Amazon auto-scaler) preceded the need.
Load average when running Scryer is much smoother.
Smoother load average results in more consistent and faster response times.
Another benefit is how Scryer protects us from outage recovery.
Retries and thundering herd effects are mitigated by consistent provisioning of instances through Scryer.
Meanwhile, because we have more predictable scaling needs that can be provisioned more granularly, our AWS costs go down.
Going global has a different set of scaling challenges. AWS enables us to add instances in new regions that are closer to our customers.
To help us manage our traffic across regions, as well as within given regions, we created Zuul. Zuul is open source in our github repository.
Zuul does a variety of things for us. Zuul fronts our entire streaming application as well as a range of other services within our system.
Moreover, Zuul is the routing engine that we use for Isthmus, which is designed to marshall traffic between regions, for failover, performance or other reasons.
We also leverage multiple regions to help us fail over one region to another, through an effort we call Active-Active.
Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
The army is the Simian Army, which is a fleet of monkeys who are designed to do a variety of things, in an automated way, in our cloud implementation. Chaos Monkey, for example, periodically terminates AWS instances in production to see how the system as a whole will respond once that server disappears. Latency Monkey introduces latencies and errors into a system to see how it responds. The system is too complex to know how things will respond in various circumstances, so the monkeys expose that information to us in a variety of ways. The monkeys are also available in our open source github repository.
Again, the dependency chains in our system are quite complicated.
That is a lot of change in the system!
As a result, our philosophy is to act fast (ie. get code into production as quickly as possible), then react fast (ie. response to issues quickly as they arise).
Two such examples are canary deployments and what we call red/black deployments.
The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new server(s) into production running new code. Monitoring the canary servers will show what the new code will look like in production.
If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
The health of the canary is automated as well, comparing its metrics against the fleet of production servers.
If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc.
We will then repeat the process until the analysis of canary servers look good.
We will then repeat the process until the analysis of canary servers look good.
We also use Zuul to funnel varying degrees of traffic to the canaries to evaluate how much load the canary can take relative to the current production instances. If the RPS, for example, drops, the canary may fail the Zuul stress test.
If the new code looks good in the canary, we can then use a technique that we call red/black deployments to launch the code. Start with red, where production code is running. Fire up a new set of servers (black) equal to the count in red with the new code.
Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
If a problem is encountered from the black servers, it is easy to rollback quickly by switching the pointer back to red. We will then re-evaluate the new code, debug it, etc.
Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
And we will stress the canary again…
If the new code looks good in the canary, we can then bring up another set of servers with the new code.
Then we will switch production traffic to the new code.
If everything still looks good, we disable the red servers and the new code becomes the new red servers.
Most companies focus on a small handful of device implementations, most notably Android and iOS devices.
At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis?
Different devices have different controlling functions as well. For devices with swipe technologies, such as the iPad, do we need to pre-load a lot of extra titles in case a user swipes the row quickly to see the last of 500 titles in their queue? Or for up-down-left-right controllers, would devices be more optimized by fetching a few items at a time when they are needed? Other devices support voice or hand gestures or pointer technologies. How might those impact the user experience and therefore the metadata needed to support them?
The technical specs on these devices differ greatly. Some have significant memory space while others do not, impacting how much data can be handled at a given time. Processing power and hard-drive space could also play a role in how the UI performs, in turn potentially influencing the optimal way for fetching content from the API. All of these differences could result in different potential optimizations across these devices.
Many UI teams needing metadata means many requests to the API team. In the one-size-fits-all API world, we essentially needed to funnel these requests and then prioritize them. That means that some teams would need to wait for API work to be done. It also meant that, because they all shared the same endpoints, we were often adding variations to the endpoints resulting in a more complex system as well as a lot of spaghetti code. Make teams wait due to prioritization was exacerbated by the fact that tasks took longer because the technical debt was increasing, causing time to build and test to increase. Moreover, many of the incoming requests were asking us to do more of the same kinds of customizations. This created a spiral that would be very difficult to break out of…
Many other companies have seen similar issues and have introduced orchestration layers that enable more flexible interaction models.
Odata, HYQL, ql.io, rest.li and others are examples of orchestration layers. They address the same problems that we have seen, but we have approached the solution in a very different way.
We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
The original OSFA API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
In this mode, there is a very clear divide between the Client Code and the Server Code. That divide is the network border.
And the responsibilities have the same distribution as well. The Client Code handles the rendering of the interface (as well as asking the server for data). The Server Code is responsible of gathering, formatting and delivering the data to the UIs.
And ultimately, it works. The PS3 interface looks like this and was populated by this interaction model.
But we believe this is not the optimal way to handle it. In fact, assembling a UI through many resource-based API calls is akin to pointillism paintings. The picture looks great when fully assembled, but it is done by assembling many points put together in the right way.
We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
In an experience-based interaction, the PS3 can potentially make asingle request across the network border to a scripting layer (currently Groovy), in this example to provide the data for the PS3 home screen. The call goes to a very specific, custom endpoint for the PS3 or for a shared UI. The Groovy script then interprets what is needed for the PS3 home screen and triggers a series of calls to the Java API running in the same JVM as the Groovy scripts. The Java API is essentially a series of methods that individually know how to gather the corresponding data from the dependent services. The Java API then returns the data to the Groovy script who then formats and delivers the very specific data back to the PS3.
We also introduced RxJava into this layer to improve our ability to handle concurrency and callbacks. RxJava is open source in our github repository.
In this model, the border between Client Code and Server Code is no longer the network border. It is now back on the server. The Groovy is essentially a client adapter written by the client teams.
And the distribution of work changes as well. The client teams continue to handle UI rendering, but now are also responsible for the formatting and delivery of content. The API team, in terms of the data side of things, is responsible for the data gathering and hand-off to the client adapters. Of course, the API team does many other things, including resiliency, scaling, dependency interactions, etc. This model is essentially a platform for API development.
If resource-based APIs assemble data like pointillism, experience-based APIs assemble data like a photograph. The experience-based approach captures and delivers it all at once.
All of the open source components discussed here, as well as many others, can be found at the Netflix github repository.