Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
1. SPS – Scale to 15k RPS
Patrice Pelland
Microsoft
2. Overview and Goals of SPS
• SPS (Shared Personalization Service)
• It is a backend storage and service
• Enables following scenarios:
• Explicit personalization
• Implicit content optimization
• Geo based customization
3. Scenario #1
Scenario#1 – WL Anonymous ID and Machine
Anonymous ID - based Explicit Personalization
Examples: Locations for weather, news, events, favorite
sports team, personal shopping list, customized page
settings, etc.
4.
5. Scenario #2
Scenario#2 – WL Anonymous ID and Machine
Anonymous ID - based Implicit Content
Optimization
Examples: User demographic & behavior based content
optimizations and/or personalization (e.g. personal
recommendation)
6.
7. Scenario #3
Scenario#3 – GEO based customization
SPS provides a Geolookup service that allows partner to
enable IP based customizations (e.g. default location,
Location based contents, GEO fencing, etc.)
8.
9. Scaling? Availability? Perf?
• Why? 150 Million users visit US Home Page /
month and with peeks of 15,000 RPS and up
to 75 million users on other HP.
• Latency goals: Read < 25 ms – update < 50 ms
• Pages have to be up - $$$ loss if not
• Need to be stateless
10. Overall Architecture
SPS
Webstore Config
Server
SPS Configuration
SPS Deployment
DataLookup
Deployment Data
SPS FE Cluster
CMS Rendering
System
Partner web server
SPS
Logic
Cache
Access
Webstor
e DB
Access
WCFService
AppFabric Cache
Cluster
Cache
BoxCache
BoxCache
BoxCache
BoxCached Data
Database Access
Lookup
System
Database
Partitions
Load
Bala
ncer
SPSAdapter
(SPS MSN
CMS service
wrapper)
Geo
Service
11. How?
• Everything is Stateless
• Windows AppFabric Caching service with
many nodes – reliable and redundant
– Similar to memcache
– 240 GB of memory cache in the US
• SQL Server DB Partitioning with lookup system
master/backup at each level
12. Facts
• Availability
– Designed with no single point of failure
– Web - multiple web servers behind a LB.
– DB
• Each DB partition has a primary & secondary DB setup with multi master topology.
• Transactional replication is used by SQL to sync the primary & secondary. If a primary DB server goes down, requests are handled by secondary DB server.
– File share: WAN Sync is used to replicate critical files across primary & secondary file server. VIP ensures automatic read availability for SPS Service when
primary goes down. Write availability for backend services is ensured by manual fail over.
– Throttling to prevent outage from abnormal traffic – throttling is configurable both at server level and at partner level. Partner level throttling is based on
around 200% of normal peak traffic
– Load balancer also has a secondary backup
• Scalability
– Web & AppFabric cache: Scalability is achieved by adding new nodes. Everything is stateless…
– DB: Databases are hosted as webstore application. Scalability is achieved by partitioning. Adding additional data partition is very easy.
• Live site metrics
– Latency: 10 ms read, 30 ms update, 12ms (async update)
– US: 39 web servers, 15 AppFabric caching server, 10 SQL lookup server and 12 SQL backend (data) servers
– Asia: 17 web servers, 8 AppFabric caching server, 8 SQL lookup server and 10 SQL backend (data) servers
– Europe: 16 web servers, 8 AppFabric caching server, 8 SQL lookup server and 10 SQL backend (data) servers
– Current Peak RPS per web box in US is 375 (14.7K RPS US), Peak CPU 40%. Server capacity is around 600RPS with 70% CPU
12
13. High-level Features
• Support shared namespace definition – reduce # of calls
• Support multiple levels of access control of shared namespace
– Behind corp firewall
• Plug-in smart defaults for namespace
– Smart Defaults return faster for cases where the user doesn’t have
customizations yet.
13
14. High-level Features
• Plug-in smart data validation for namespace
– Small DLLs validate pre-compiled on the server
• Bulk upload of implicit user preference or clustering info
• Geolookup service – One stop shop – reduce calls
• Support both netTCP calls and WCF calls – if in the same DC
then netTCP 35% faster than normal TCP
• Service is available globally: US, Europe and Asia – Closer to
the user.
14
15. High-level Features
• Introduction of an API for Async update
– Designed to support implicit updates or storing session data. In this case, user does not
explicitly make an effort to update his/her setting. Instead, by just browsing a page, or click a
link, corresponding settings are stored on SPS.
• Examples: Recent stock list from doing stock quotes on MSN Money site, Search History, Article List where user clicked
thumb up/down, etc.
– Two stage updates: 1) data from client request is first saved in cache; then 2) batch updates to
DB, thus allowing faster response time to client. Optimized for writes
– Data is in memory for a short period of time before being written to DB. We are using
AppFabric high availability mode (i.e. dual cache copy) to minimize potential data loss. Data
loss may occur only if both cache servers are down at the same time.
– Async update can be turned on/off at attribute level via admin UI. E.g. User’s preferred
locations are not using Async update, but Money Recent Quotes may be.
15
16. Anatomy of a
Get API Call 16
SPS
Endpoint
(WCF, CF)
AppFabric
Cache
Partition
Lookup
CoreCore
Core
Core
(1)
Lookup Data in Cache
(2)
Return Data Found in
Cache
(3)
UserId for lookup
(Cache miss)
(4)
Core Partition Information
for User Record
(5)
Query for records
(6)
User Records
MSN Geo
Lookup Service
Partition
Lookup
(7)
User IP
(8)
User
Location/Connection
Info
Smart Defaults
Provider
Smart Defaults
Provider
Smart Defaults
Provider
(9)
User
RevIPInfo and
Data missing
from DB
(10)
Defaults for
Missing Data
(0)
Partner
Request
(12)
Response
(11)
Write to Cache
17. Anatomy of an
Update API Call17
SPS
Endpoint
(WCF, CF)
AppFabric
Cache
Partition
Lookup
CoreCore
Core
Core
(3)
UserId for lookup
(6)
Core Partition Information
for User Record (8)
Write records
(9)
Success/Fail
Partition
Lookup
Smart
Defaults
Provider
Smart
Defaults
Provider
Smart
Validator
Provider
(1)
Validate
Request
(2)
Success/Fail
(0)
Partner
Request
(10)
Response
(7)
Invalidate Cache
(5)
Create lookup record
(4)
User Not Found
18. Anatomy of an
Async Write Call
CacheSweeper
18
SPS
Endpoint
(WCF, CF)
Main Cache
Core
Core
Core
Async Cache
1. Async Write Request
2. Invalidate Main cache
3. Write to Async cache
a. Batch Read for DB Loading
b. DB Load
5. Response
4. Return (success)
c. Invalidate Async cache
19. Anatomy of an
Async Read Call
19
SPS
Endpoint
(WCF, CF)
Main Cache
Core
Core
Core
Core
Async Cache
1.Read Request
4. Read from Main Cache
2. Read from Async cache
Partition
Lookup
Partition
Lookup 3. Cache miss from Async cache
5. Cache miss from Main cache
8. Query for records
9. User Records
10. Write to Main Cache
11. Response
6. UserId for lookup
(Cache miss)
7. Core Partition
Information for User Record
SPS stands for Shared Personalization Service
SPS is a service created by MSN to stop the proliferation of profiles. It is used by many teams at Microsoft mostly in MSN.
Backend Storage of user customizations, optimization keys and a service backbone that offers different entry points to the data
Anonymous – no way to track who is the person from this.