A Cloud Gateway -
A Large Scale Company’s First Line
of Defense
Mikey Cohen
Manager - Edge Gateway
Netflix
Today, more than 36% of
North America’s internet
traffic is controlled by
systems in the Amazon
Cloud
Global Streaming of TV Shows and
Movies
Nearly 70 Million Subscribers
In over 80 Countries
Netflix accounts for over 36% of
Downstream Traffic in North
America
From the Internet to Services in the Cloud
Gateway
Gateway
?????
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Our Edge Gateway @ Netflix
Handles most netflix.com hosts
Over 20 production Zuul clusters
~ 50 elbs
Gateway handles ~10 origin services
Netflix Gateway Scale
Tens of billions of requests per day
3 AWS regions
Over 1000 device types
Hundreds of permutations of protocols and
device versions
Success
Evolution
Scale
Failure
Our Journey
So What!? - Change your perspective!!
Traditional Cloud Proxy Mission
Simple static rule-based routing
API portal
Request authentication
Throttling - request caps
Monitoring
Caching
The Gateway - a grown-up proxy!
●Dynamic routing
●Deep Insights
●Load balancing
●Availability focused
●Service protection
●Quality assurance tool
Evolving to a Gateway
Netflix’s Public API
Late 2008
Mashery
Datacenter
Streaming Devices using public API
Early Streaming Devices - 2009
Windows Media Center
XBox
PS3
Migration to AWS
2010
Sonoa / Apigee proxy
Device traffic, not public
Controlling DC -> cloud
migration
Running in AWS
Under Netflix control
Streaming Success
2011
Chaos
Complexity
Failure
Success
Leveraging
Cloud benefits
Anti-patterns of most cloud proxies
Static configurations
Service push needed to
change behavior
Limited range of
functionality
Limited to HTTP
Zuul Created
2012
Dynamically injected and compiled filters
Manipulate requests and responses
Headers / Body / etc
Change routing
Add metrics and other functions
Built on Netflix’s OSS stack
Open Sourced
Zuul - A Victim of Success
Easy and convenient
Instant results
High adoption
Happy customers
Business logic in proxy
Affects system resiliency
Zuul team in critical path
Creating a Gateway
Strategy
Principles of Netflix’s Gateway Strategy
Creative Routing
Dynamic Routing
Delivery Focused
Traffic Shaping
React Fast
Insights
Creative Routing - Subclusters with Purpose
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
Red / Green Deployments
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
Instrumented
squeeze
squeeze
Developer Test Branches
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
Instrumented
squeeze
squeeze
Instrumented Clusters
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Squeeze Testing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
Targeted Routing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debu
g
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
Service “Canarying”
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
“Sticky” Canary
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
canary
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Failure Injection Testing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Degraded Experience Testing
Gateway
Gateway
Gateway
Origin (API)
v1
v2
test
debug
Instrumented
squeeze
“sticky”
canarybaseline
“sticky”
baseline
v1
v2
test
debug
baseline canary
“sticky”
canary
“sticky”
baselineFIT
Instrumented
squeeze
squeeze
Traffic Shaping
A Global Cloud Deployment
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Global Cloud Routing
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
A Failing region
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Gateway routing to other regions
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Attack prevention
Gateway
Gateway
Gateway
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Smart Load Balancing
Gateway
Gateway
Gateway
Origin (API)
Smart Load Balancing - Bad Nodes
Gateway
Gateway
Gateway
Origin (API)
Gateway Backoff and Blacklists Bad Nodes
Gateway
Gateway
Gateway
Origin (API)
Zone Failure - Blacklist the Zone automatically
Gateway
Gateway
Gateway
Origin (API)
React Quickly - Runtime Filter changes
Gateway
Gateway
Gateway
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Runtime Policy
Injection
A Room with a View - Insights
Gateway
Gateway
Gateway
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Insights
What’s Next for Netflix’s Gateway?
Gateway as a service
Self-service dynamic routing / route validation
Control APIs for special routing functions
Netty Based Zuul (using RxNetty)
Handling persistent connections
non-blocking, async
Transport protocol agnostic routing
Reactive Socket http://reactivesocket.io/
Top Ten Lessons Learned
Build for handling
Failures
Expect the Unexpected
Using Routing Creatively
Shard to Reduce Blast
Radius
Devices are Weird
Protocols are Weird
Devices are Forever
Protocols are Forever
It will be built “wrong”
Keep Business Logic out
of your Gateway
For More Info...
Zuul OSS
Netflix Tech Blog
RxNetty
Jobs

Rethinking Cloud Proxies

Editor's Notes

  • #12 Our gateway strategy will change the way you think about resiliency, debugging, continuous delivery, service operations, and insights.
  • #19 Devices slow to update Need emergency policies Fast action
  • #20 Limited range of functionality Hard to program Authentication Authorization Static responses / Origin specific headers Why? Federation of logic across systems creates complexity Minimize gateway dependencies to maximize availability
  • #24  Origin services run many clusters Route to service clusters based on dynamic routing rules Shape or reject traffic based on service, regional health, or attack React fast in emergencies Realtime analytics and insights Ensures request delivery from internet to services running in the cloud Dynamically changes routing behaviors Routes to services Services have multiple clusters Clusters have dynamically changing nodes Bridges multiple cloud regions and data centers Provides system Insights
  • #25 Same service: Subclusters for many purposes Set up by filters in Zuul Self serviceable by cluster owners Automated Quality assurance / Test Automation Targeted debugging Test Automation Canary / Baseline A/B testing of service behavior per build Squeeze Testing Service capacity testing Trickle traffic Instrumented builds Sticky Canary A/B testing of client behavior per origin build
  • #28 Trickling traffic into clusters High Overhead profiling tools “Coalmine” verbose logging
  • #29 Server capacity testing Gateway gradually increases traffic until performance degradation is detected Automated or manual
  • #30 Isolate requests by customer, route, type of device, or any routing rule Debug node(s) are often instrumented to give verbose logging Custom Request Routing
  • #31 Compare server behavior and metrics Equal traffic rates hit both clusters Automated part of production push process Error rates CPU for equivalent work Automated metrics analysis returns a score of how well the canary cluster performed A poor score stops the push process
  • #32 Servers may be healthy data may be bad API changes that affect devices Data changes certain devices can’t interpret Protocol and transport changes that some devices can’t accept Testing 1000’s of types of devices would be a time consuming, tedious process. Sticky Canary idea - Stick all requests for a small subset of customers for a limited time to a “sticky canary” or “sticky baseline” If servers are equivalent, there should be no behavioral differences. Insights can help find these anomalies Limited scope of impact - a very small subset of customers could be affected but only for a short period of time
  • #37 Reroute to the closer region to the client - DNS accuracy issues, etc Reroute due to region failure.
  • #40 Speedbump Dynamic DDOS prevention