www.flipkart.com• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as… Issue RCA Actions Learnings• *1+ Issue: Website is “slow”•  RCA = Root Cause Analysis
Surviving & reacting to the environmentINFANCY (2007 – MID-2010)
RCA• Why? – MySQL queries taking too long• Why? – Too many queries – Many slow queries – Queries locking tables• Why? – Capacity• Hmm…
Fixing it• Get beefier servers (the obvious)• Separate master_db, slave_db – Writes go to master_db – Reads from slave_db – Critical reads from master_db Writes Reads Reads Writes MySQL MySQL MySQL Replication Slave Master
Learning from it• Scale-out databases reads by distributing load across systems• Isolate database writes from reads – Writes are (usually) more critical
RCA• Why? – MySQL queries taking too long (on slave_db)• Why? – Too many queries – Many slow queries• Why? – Queries from analytics / reporting and other backend jobs• Urm…
Fixing it• Analytics / reporting DB (archival_db) – Use MyISAM — optimized for reads – Additional indexes for quicker reporting Website Website Writes ReadsWebsite WebsiteWrites Reads MySQL MySQL Replication Slave 1 MasterMySQL MySQL Replication SlaveMaster Replication Analytics MySQL Analytics Reads Slave 2 Reads
Learning from it• Isolate the databases being used for serving website traffic from those being used for analytical/reporting• Isolate systems being used by production website from those being used for background processing
RCA - 2• Why? – Service Oriented Architecture (SOA) – Too many calls to remote services per request • Creating fresh connection for each call • All the calls are made in serial order Connect to Request Connect Request Send Receive request Service1 Service1 Service2 Service2 response
RCA - 3• Why? – Configurability – Fetch a lot of “config” from database for serving each request Receive Fetch Fetch Fetch Fetch Send request Config1 Config2 Config3 Config4 response
RCA – 1,2,3• Why? – Logging a lot – SOA – Configurability• Why? – PHP’s process model• Argh!
Fixing it• fk-w3-agent – Simple Java “middleware” daemon – Deployed on each web server – PHP communicates to it through local socket – Hosts pluggable “handlers”
RCA• Why? – PHP processes taking up too much time – PHP processes taking up too much CPU• Why? – Product info deserialization taking up time/CPU – View construction taking up time/CPU
Fixing it• Caching!• Cache fully constructed pages – For a few minutes – Only for highly trafficked pages (Homepage)• Cache PHP serialized Product objects – ~20 million objects – Memcache• Yeah! But… – Add caching => add complexity
Caching: Complications (1)• “Caching fully constructed pages”• But parts of pages still need to be dynamic • Example: Logged-in user’s name• Impossible to do effective bucket testing • Or at least makes it prohibitively complex
Caching: Complications (2)• “Caching PHP serialized Product objects”• Without caching: getProductInfo() Fetch from CMS• With caching, cache hit: getProductInfo() Fetch from Cache• With caching, cache miss: Fetch from Fetch from getProductInfo() Set in Cache Cache CMS
Caching: Complications (3)• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache – Receive “notifications” about product updates • Notification Server — pushes notifications raised by CMS• Use a persistent, distributed cache – Memcache => Membase, Couchbase
Learning from it• Caching is a powerful tool for performance optimization• Caching adds complexities – Reduced by keeping cache close to data source – Think deeply about TTL, invalidation• Use caching to go from “acceptable performance” to “awesome performance” – Don’t rely on it to get to “acceptable performance”
RCA• Why? – Search-service is slow (or Reviews-service is slow or Recommendations-service is slow)• But why is rest of website slow? – Requests to the slow service are blocking processing threads• Eh?!
Let’s do some math• Let’s say – Mean (or median) response time: 100 ms – 8-core server – All requests are CPU bound• Throughput: 80 requests per second (rps)• Let’s also say – 95th Percentile response time: 1000 ms • Call them “bad requests”• 4 bad requests in a second – Throughput down to 44 rps• 8 bad requests in a second? – Throughput down to 8 rps
Fixing it• Aggressive timeouts for all service calls – Isolate impact of a slow service • only to pages that depend on it• Very aggressive timeouts for non-critical services – Example: Recommendations • On a Product page, Search results page etc. • Not on My Recommendations page• Load non-critical parts of pages through AJAX
Learning from it• Isolate the impact of a poorly performing services / systems• Isolate the required from the good-to-have
RCA• Why? – Load average of web servers has spiked• Why? – Requests per second has spiked • From 1000 rps to 1500 rps• Why? – Large number of notifications of product information updates
Fixing it• Separate cluster for receiving product info update notifications from the cluster that serves users• Admission control: Don’t let a system receive more requests than it can handle – Throttling• Batch the notifications
Learning from it• Isolate the systems serving internal requests from those serving production traffic• Admission control to ensure that a system is isolated from the over-enthusiasm of a client• Look at the granularity at which we’re working
Mistake?• Sub-optimal decision – Not all information/scenarios considered – Insufficient information – Built for a different scenario• Due to focus on “functional” aspects• A mistake is a mistake – … in retrospect
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.