Mohamed Elmergawi presented on creating a highly available persistent session management service using Redis and a connection pooling proxy. The presentation outlined the challenges of Zulily's legacy architecture and proposed a new architecture using a connection pooling proxy, session service with Redis and Dynomite for real-time replication across multiple regions. This new architecture improved availability, reduced overhead from establishing new connections and leveraged existing connections efficiently. Testing showed the new system could recover from outages within 250ms with a failure rate of only 0.42%.
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Redis presentation
1. PRESENTED BY
Creating a Highly Available Persistent
Session Management Service with Redis
and a Connection Pooling Proxy
Lead Software Engineer, Zulily
Mohamed Elmergawi
2. 2
A NEW STORE EVERY DAY
Thousands of products at brag-worthy prices
INSPIRED, DISCOVERY-DRIVEN EXPERIENCE
without specific purchase intent
HIGHLY CURATED SALES EVENTS
100+ time-limited sales (72 hours)
A DAILY DESTINATION
75% orders via mobile (Q319)
MASSIVELY PERSONALIZED APPROACH
Launch millions of versions of the site/app
daily
GLOBAL MARKETPLACE
15,000+ vendors including Under Armour,
Cuisinart, Melissa & Doug
ZULILY’S BUSINESS CREATES INTERESTING TECHNICAL
CHALLENGES
3. PRESENTED BY
A reliable global session service is critical:
• If it goes down, you can't serve customers
• Infrastructure is volatile; we need persistence
• Speed is key
“Everything fails all the time” - Werner Vogels, CTO Amazon
Problem Definition
4. PRESENTED BY
• No HA: a hardware or
network degradation
leads to a failure
• Sharding logic is coupled
in the application level
• Requires manual
intervention to promote
a slave to master
• Limits global expansion
• Idle slave nodes
Legacy Architecture
APP CLUSTER
TWEMPROXY
R/W
REST API
APPLICATION CLUSTER
TWEMPROXY
Master Node
SLAVE NODE
R/W
Async
Replica
SITE CLUSTER
TWEMPROXY
R/W
Master Node
SLAVE NODE
R/W
. . .
. . .
Async
Replica
5. PRESENTED BY
Redis Cluster
• Not suited for applications that require availability in the event
of large net splits
• Active passive mode
Redis Sentinel
• The sharding logic would still be coupled with the application
• Active passive mode
Alternative Approaches
6. PRESENTED BY
New Architecture
• Connection Pooling Proxy
• Session Service
• Real-time Replications
Session service
1
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
SITE CLUSTER
PROXY
ALB
APP CLUSTER
PROXY
ALB
Session service
n
Session service
2
. . .
. . . . . .
. . .
. . .
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Region c
Region a Region b
7. PRESENTED BY
• Reduces the overhead associated with establishing a new
connection
• Leverages existing connections efficiently
• Constrains the total number of connections
Connection Pooling Proxy for Site and App Cluster Nodes
8. PRESENTED BY
• Request routing based on consistent hashing (Murmur hash)
• Traffic distribution based on GEO location
• Topology aware load balancing (Token Aware)
• Request rerouting based on failed functional or latency health
checks
Session Service
a1
a2
a3
0 - 100
101 - 200
201 - 300
9. PRESENTED BY
Session Service
Real-Time Replication between Redis Nodes via Dynomite
P2P and active/active approach
Data Center b
b1
b3
b2
Data Center a
a1
a3
a2
Data Center c
c1
c3
c2
session id 1 hash
session id 2 hash
Incoming write, with persistent hashing
Replication
10. PRESENTED BY
• Staged rollout
• Double write (Time T1)
• Copied data offline from the slave nodes (Prior to T1)
• Double read
• Data sanity checks
• Apply chaos engineering principles to the new system
Production Rollout
12. PRESENTED BY
• Scale can only happen in multiple hosts
• Higher network traffic volume
• Cross-AZ/Regions/DC traffic costs money
• Adding hosts to the ring is a manual process
Drawbacks
13. PRESENTED BY
• Connection Pooling Proxy
• Session Service
• Redis is not only a cache, it
is a persistent storage
• Design for failure
• Use Chaos
Engineering practices
• Replicate your data across
multiple regions and use
real time replication
Summary
Session service
1
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
SITE CLUSTER
PROXY
ALB
APP CLUSTER
PROXY
ALB
Session service
n
Session service
2
. . .
. . . . . .
. . .
. . .
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Redis
+
Dynomite
Region c
Region a Region b
Lead Engineer for E Commence platform team at ZULILY.
I WILL TALK ABOUT HOW AT ZULIY USED REDIS TO BUILD A HIGHLY AVAILABLE PERSISTENT SESSION MANAGEMENT
HOW Zulily BUSINESS MODEL CREATED its specific technical challenges and the role of session management
Zulily business model is all about discovery driven experience ,Our customers comes to site/apps to discover and enjoy liking going to a mall or a boutique
Zulily launches a new story every day which is technically launching millions of personalized versions of the site/app daily
That translates to specific technical challenges.
-Nature of traffic is spikey which means time warm the cache is not an option.
-Speed is critical ,
-Customer session flow is critical for a smooth discovery and is called per every single request.
A Reliable global session service is critical:
It goes down, you can't serve customers ,As every single request to apps or site requires a session.
Infrastructure is volatile; we need persistence
Speed is critical
As engineers the main fact we believe in is “Every thing fails , All the time”
Bad code push
Hard ware failure
Network Latency
Regions /AZ outage.
That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience.
In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer.
Session management service sharded across multiple AZs
One AZ Outage % of customers
Business Impact == $$$
Typical architecture
Client Layer (Apps and Site cluster )
Twem Proxy (Twemproxy played the role of proxy and connection pooling was deployed on every client machine . )
Application ayer :Sharding logic coupled with the application layer, Session service shared same resource with other application resources.
Customer session lived in (Redis as permanent storage with slave nodes as back ups)
Problems
1-Not HA (Losing hardware/network partition will lead to outage) ,
Network Latency will lead degraded experience.
2-Sharding is coupled which limited scaling and global expansion off Zulily.We want out data close to our customers.
Losing an AWS AZ caused us major outage and degraded experience.
As session data is used for every request to Zulily app.This was not acceptable.
That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience.
In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer.
Session management service sharded across multiple AZs
One AZ Outage % of customers
Business Impact == $$$
That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience.
In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer.
Session management service sharded across multiple AZs
One AZ Outage % of customers
Business Impact == $$$
Client
Replaced twem proxy with a custom proxy as client no more directly connect to Redis acting as TCP connection pooling
Server
Used consistent hashing and abstracted sharding logic and geo location detection to a new service scaling horizontally.
Storage
Used Redis as storage layer distributed across multiple regions in a ring topology for consistent hashing and we used dynamite for replication across regions/data centers.
Now I will deep dive in every layer the client ,Server and Data layers.
---------------------------------
What did we need ?
Highly Available, Geo distributed and Scalable
Tolerate hardware/partition failures and network degradation
Seamless Customer Experience
1,000,000s of requests
Connection Pooling every node in the app and site cluster.
Overhead of establishing a new TCP connection collecting metrics (Service Mesh/ Envoy Proxy)
Leverages existing connection
Constrains total open connections against load balancer.
-------------------------------------------------------------
That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience.
In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer.
Session management service sharded across multiple AZs
One AZ Outage % of customers
Business Impact == $$$
We got rid of master/slave approach and used P2P using dynamite a netflix open source project for replication across regions.
Data center definition is just a virtual grouping , regions or az or even on premises
Read Request Life cycle,
Consistent Hashing by service layer.
Service layer will route to the right node in the ring that has the data.(Either going to A1 ,A2 , A3)
------------------------------------------
That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience.
In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer.
Session management service sharded across multiple AZs
One AZ Outage % of customers
Business Impact == $$$
[FA] the actual graph isn't important other than to show latency remained flat, maybe add vertical lines to show when the network was killed...
That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience.
In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer.
Session management service sharded across multiple AZs
One AZ Outage % of customers
Business Impact == $$$
That brings us to the reason we’re here today – to discuss how we at Zulily evolved our infrastructure to a more distributed system with the help of Redis – to create a more reliable experience.
In retail, a session service is critical – especially if your footprint is global. But – we all know this familiar quote from Werner Vogels. Failure is bound to happen – our jobs as engineers are to plan for failure – and to ensure that no matter what, we can serve the customer.
Session management service sharded across multiple AZs
One AZ Outage % of customers
Business Impact == $$$