14. Gaya Varadarajan
Senior Software Engineer
Cloud Gateway
Kim Trott
Engineering Director
Edge Device Services
Karen Casella
Engineering Leader
Edge & Playback Access
Haripriya Murthy
Senior Software Engineer
Playback Licensing
Sangeeta Narayanan
Engineering Director
Edge Dev Experience
Daniela Enyedi
Senior Software Engineer
DNA API
Meet the Women Living on the Edge
53. Platform as a Service
Allows engineers “...to develop, run, and manage
applications without the complexity of building and
maintaining infrastructure…”
- wikipedia
Edge provides functionality and metadata to power core Netflix product experiences
Edge provides functionality and metadata to power core Netflix product experiences
Edge provides functionality and metadata to power core Netflix product experiences
Edge provides functionality and metadata to power core Netflix product experiences
Zuul: Internet facing service, provides routing, traffic shaping, security and more
NQ: Translation layer to optimize experience for each device
API: Orchestration and abstraction over other Netflix services
PBL: Handles content licensing
EDX: Tools and infrastructure to enable engineers to develop and operate complex systems at scale
Handles traffic from nflx customer all over the world supporting 1000’s of device types, proxying 10s of billions of requests in a day
Zuul is the front door to nflx server infra
Zuul is analogous to traffic steering, routing, and insights into Netflix’s cloud systems
Example of tv scenario of a bad config change and how zuul scaled well
Example of tv scenario of a bad config change and zuul scaled well
Cross Region Resiliency when a backend gets into trouble in one region.
Zuul offers a lot of cool features
But I am going to focus on self service routing
At the begining, few requests were coming in for route changes
Getting Gateway out of the way for route changes
Self Service UI
Assign her primary route
Ramp a new backend slowly to productionize it
Override a small % of traffic to a single instance cluster for debugging
squeeze some traffic to establish benchmark of CPU to RPS
Configure Security rules to reject spam
She can Assign her primary route
She can Override a small % of traffic to a a single instance cluster for debugging
She can squeeze some traffic the performance of the
My name is Kim and I’ve been at Netflix for over 10 years. I started out as a UI Engineer working on the Website back when that’s the only platform we had. Now I’m deep on the server infrastructure side.
Netflix is continuously innovating to deliver the best possible customer experience. You may be familiar with our TV UI.
And this is our iOS mobile experience.
While they’re both Netflix, they’re actually quite different. The form factors of TVs and mobile devices are different and input (touch vs. remote control). Also, the size and orientation of the imagery and the metadata fields making up the screen.
Powering the UI takes a lot of data (and personalization algorithms), but both of these UIs have very different data needs.
That’s why….
No I’m not talking about best friends, though for some engineers this may be their best friend
I’m talking about Backends for Frontends.
Sam Newman wrote about this pattern in 2015. It’s fun when the thing you’ve been doing for years gets a name!
Resources:
https://samnewman.io/patterns/architectural/bff/
https://nordicapis.com/building-a-backend-for-frontend-shim-for-your-microservices/
With a BFF, each UI team can have their own backend service for completely customizing the data needs for their UI/device application. Giving them complete control over the request-response lifecycle between the device and server.
Translation layer enables:
Customization, business logic: Get the data they need and send it back in the format that best suits that device.
Rapid iteration and A/B testing: Change what data is returned without needing to coordinate with other teams
Creates thinner, more focused services
The core charter of our UI engineering team is to rapidly iterate on the user experience.
Edge provides a platform that enables devices teams to rapidly and easily deploy services for their front-end application, without having to deal with the complexity of server infrastructure, high availability, fault tolerance, etc.
We provide the platform and manage the infrastructure. They bring their code / scripts.
Use Node.js as the technology as best overall fit for UI teams.
Isolation:
Isolate failures
Independent root cause of issues
Independent autoscaling
Better regression / performance testing
Isolation:
Isolate failures
Independent root cause of issues
Independent autoscaling
Better regression / performance testing
Isolation:
Isolate failures
Independent root cause of issues
Independent autoscaling
Better regression / performance testing
Isolation:
Isolate failures
Independent root cause of issues
Independent autoscaling
Better regression / performance testing
Isolation:
Isolate failures
Independent root cause of issues
Independent autoscaling
Better regression / performance testing
If every BFF had to talk to the hundreds of microservices at Netflix, it would overwhelm our UI teams and prevent them from rapidly iterating on the user experience. That’s why we have the API Service Layer to aggregate and orchestrate all the mid-tier at Netflix and insulate and abstract that layer from UI Engineering teams. That’s what Karen Casella is going to talk about next.
API service layer
Traffic sharding to improve availability
Orchestration: Zuul -> NQ -> API: orchestrates / owns order of operations / fetches data from back-end systems, aggregates & returns data to upstream
Availability Protection: priority is to favor streaming over all other functionality, API can not go down entirely, Hystrix fault tolerance pattern as a library, handle errors with fallbacks, may result in degraded customer experience, but at least they can stream, throttling
Abstraction: shield upstream teams from downstream system knowledge & changes, provides APIs not provided by downstream services, caching, batch APIs
Monolithic architecture challenges
Observability
Reduce time to detect & debug issues
Flamegraphs have too much information
Time to Resolve
Monolith, high start-up times
Inhibits fast releases / rollbacks
Image credits
https://www.gannett-cdn.com/-mm-/c7c72be3b5ba5526bd2a95f450ca45139f4b704f/c=0-79-1483-1191&r=x404&c=534x401/local/-/media/2015/08/06/Indianapolis/B9318361924Z.1_20150806155035_000_GPCBIAR98.1-0.png
https://www.netbraintech.com/wp-content/uploads/2017/07/saas-mttr-300x187.png
Blast Radius
Isolation in failure scenarios
Bad code push / downstream service unavailable
Image credits
https://www.gannett-cdn.com/-mm-/c7c72be3b5ba5526bd2a95f450ca45139f4b704f/c=0-79-1483-1191&r=x404&c=534x401/local/-/media/2015/08/06/Indianapolis/B9318361924Z.1_20150806155035_000_GPCBIAR98.1-0.png
https://www.netbraintech.com/wp-content/uploads/2017/07/saas-mttr-300x187.png
100s of changes in a 24 hour period. How to sustain this velocity?
From idea to release in shortest amount of time
Reducing friction in the development process enables our engineers to move fast.
Moving fast is risky at our scale and complexity. Failure is inevitable.
Moving fast is risky at our scale and complexity. We embrace failure and focus on minimizing the time to detect, recover from and root cause issues.
Our observability suite includes various capabilities such as the ones listed here.
Granular insights into system behavior - per device, per request, in near real time
http://bit.ly/2Dsbjsz
Our business has seen impressive growth over the past few years. We work hard to ensure our systems can scale to support this type of growth.
Bur not just enough that we scale our systems. Scaling ourselves is equally important.
An emerging area of focus is the human factors involved in operating in an environment such as ours. We believe that is key to maintaining the balance between velocity and reliability.
Join the women of Edge Eng and our allies for dinner
Zuul: Internet facing service, provides routing, traffic shaping, security and more
NQ: Translation layer to optimize experience for each device
API: Orchestration and abstraction over other Netflix services
PBL: Handles content licensing
EDX: Tools and infrastructure to enable engineers to develop and operate complex systems at scale