Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • “This has basically given us lots of opportunities to make mistakes. And make mistakes we did.”
  • Website Architecture diagram goes here
  • No
  • Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

    1. 1. Flipkart Website Architecture Mistakes & Learnings Siddhartha Reddy Architect, Flipkart
    2. 2. June 2007
    3. 3. November 2007
    4. 4. December 2012
    5. 5.• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as… Issue[1] RCA[2] Actions Learnings• *1+ Issue: Website is “slow”• [2] RCA = Root Cause Analysis
    6. 6. Surviving & reacting to the environmentINFANCY (2007 – MID-2010)
    7. 7. Website is “slow”!
    8. 8. RCA• Why? – MySQL queries taking too long• Why? – Too many queries – Many slow queries – Queries locking tables• Why? – Capacity• Hmm…
    9. 9. Fixing it• Get beefier servers (the obvious)• Separate master_db, slave_db – Writes go to master_db – Reads from slave_db – Critical reads from master_db Writes Reads Reads Writes MySQL MySQL MySQL Replication Slave Master
    10. 10. Learning from it• Scale-out databases reads by distributing load across systems• Isolate database writes from reads – Writes are (usually) more critical
    11. 11. Website is “slow”! (Again)
    12. 12. RCA• Why? – MySQL queries taking too long (on slave_db)• Why? – Too many queries – Many slow queries• Why? – Queries from analytics / reporting and other backend jobs• Urm…
    13. 13. Fixing it• Analytics / reporting DB (archival_db) – Use MyISAM — optimized for reads – Additional indexes for quicker reporting Website Website Writes ReadsWebsite WebsiteWrites Reads MySQL MySQL Replication Slave 1 MasterMySQL MySQL Replication SlaveMaster Replication Analytics MySQL Analytics Reads Slave 2 Reads
    14. 14. Learning from it• Isolate the databases being used for serving website traffic from those being used for analytical/reporting• Isolate systems being used by production website from those being used for background processing
    15. 15. Learning the basicsBABY (2010 – 2011)
    16. 16. Website is “slow”!
    17. 17. RCA• Why?• How? – Instrumentation
    18. 18. RCA - 1• Why? – Logging a lot – PHP processes blocking on writing logs Request2 -> Process2 Writing Waiting WaitingRequest1 Request3 Request2 Request2 Request3-> Process1 -> Process3 :Process1 :Process2 :Process3 Log file
    19. 19. RCA - 2• Why? – Service Oriented Architecture (SOA) – Too many calls to remote services per request • Creating fresh connection for each call • All the calls are made in serial order Connect to Request Connect Request Send Receive request Service1 Service1 Service2 Service2 response
    20. 20. RCA - 3• Why? – Configurability – Fetch a lot of “config” from database for serving each request Receive Fetch Fetch Fetch Fetch Send request Config1 Config2 Config3 Config4 response
    21. 21. RCA – 1,2,3• Why? – Logging a lot – SOA – Configurability• Why? – PHP’s process model• Argh!
    22. 22. Fixing it• fk-w3-agent – Simple Java “middleware” daemon – Deployed on each web server – PHP communicates to it through local socket – Hosts pluggable “handlers”
    23. 23. fk-w3-agent: LoggingHandler Request2 Request2 -> Process2 -> Process2Request1 Request3 Request1 Request3-> Process1 -> Process3 -> Process1 -> Process3 fk-w3- Log file agent Async / buffered Log file
    24. 24. fk-w3-agent: ServiceHandler(s) Connect to Request Connect Request SendReceive request Service1 Service1 Service2 Service2 response Call Receive request Send response fk-w3-agent fk-w3- agent Service1 Service2
    25. 25. fk-w3-agent: ConfigHandlerReceive Fetch Fetch Fetch Fetch Sendrequest Config1 Config2 Config3 Config4 response Database Fetch all config from Receive request Send response fk-w3-agent fk-w3- agent Poll and cache Database
    26. 26. Learning from it• PHP — good for frontend and templating – Gives a lot of agility – Limiting process model • Hurdle for high performance• Java — stability and performance• Horses for courses
    27. 27. Website is “slow”! (Again)
    28. 28. RCA• Why? – PHP processes taking up too much time – PHP processes taking up too much CPU• Why? – Product info deserialization taking up time/CPU – View construction taking up time/CPU
    29. 29. Fixing it• Caching!• Cache fully constructed pages – For a few minutes – Only for highly trafficked pages (Homepage)• Cache PHP serialized Product objects – ~20 million objects – Memcache• Yeah! But… – Add caching => add complexity
    30. 30. Caching: Complications (1)• “Caching fully constructed pages”• But parts of pages still need to be dynamic • Example: Logged-in user’s name• Impossible to do effective bucket testing • Or at least makes it prohibitively complex
    31. 31. Caching: Complications (2)• “Caching PHP serialized Product objects”• Without caching: getProductInfo() Fetch from CMS• With caching, cache hit: getProductInfo() Fetch from Cache• With caching, cache miss: Fetch from Fetch from getProductInfo() Set in Cache Cache CMS
    32. 32. Caching: Complications (3)• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache – Receive “notifications” about product updates • Notification Server — pushes notifications raised by CMS• Use a persistent, distributed cache – Memcache => Membase, Couchbase
    33. 33. Learning from it• Caching is a powerful tool for performance optimization• Caching adds complexities – Reduced by keeping cache close to data source – Think deeply about TTL, invalidation• Use caching to go from “acceptable performance” to “awesome performance” – Don’t rely on it to get to “acceptable performance”
    34. 34. Growing upKID (2012)
    35. 35. Website is “slow”!
    36. 36. RCA• Why? – Search-service is slow (or Reviews-service is slow or Recommendations-service is slow)• But why is rest of website slow? – Requests to the slow service are blocking processing threads• Eh?!
    37. 37. Let’s do some math• Let’s say – Mean (or median) response time: 100 ms – 8-core server – All requests are CPU bound• Throughput: 80 requests per second (rps)• Let’s also say – 95th Percentile response time: 1000 ms • Call them “bad requests”• 4 bad requests in a second – Throughput down to 44 rps• 8 bad requests in a second? – Throughput down to 8 rps
    38. 38. Fixing it• Aggressive timeouts for all service calls – Isolate impact of a slow service • only to pages that depend on it• Very aggressive timeouts for non-critical services – Example: Recommendations • On a Product page, Search results page etc. • Not on My Recommendations page• Load non-critical parts of pages through AJAX
    39. 39. Learning from it• Isolate the impact of a poorly performing services / systems• Isolate the required from the good-to-have
    40. 40. Website is “slow”! (Again)
    41. 41. RCA• Why? – Load average of web servers has spiked• Why? – Requests per second has spiked • From 1000 rps to 1500 rps• Why? – Large number of notifications of product information updates
    42. 42. Fixing it• Separate cluster for receiving product info update notifications from the cluster that serves users• Admission control: Don’t let a system receive more requests than it can handle – Throttling• Batch the notifications
    43. 43. Learning from it• Isolate the systems serving internal requests from those serving production traffic• Admission control to ensure that a system is isolated from the over-enthusiasm of a client• Look at the granularity at which we’re working
    44. 44. Increasing complexityTEENAGER
    45. 45. THANK YOU
    46. 46. Mistake?• Sub-optimal decision – Not all information/scenarios considered – Insufficient information – Built for a different scenario• Due to focus on “functional” aspects• A mistake is a mistake – … in retrospect