Techtalktrack2 sid-final-130207111143-phpapp02

1,270 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,270
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
38
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • “This has basically given us lots of opportunities to make mistakes. And make mistakes we did.”
  • Website Architecture diagram goes here
  • No
  • Techtalktrack2 sid-final-130207111143-phpapp02

    1. 1. Flipkart Website Architecture Mistakes & Learnings Siddhartha Reddy Architect, Flipkart
    2. 2. June 2007
    3. 3. November 2007
    4. 4. December 2012
    5. 5. www.flipkart.com • Started in 2007 • Current Architecture from mid 2010 • Evolution of the architecture presented as… • [1] Issue: Website is “slow” • [2] RCA = Root Cause Analysis Issue[1] RCA[2] Actions Learnings
    6. 6. INFANCY (2007 – MID-2010) Surviving & reacting to the environment
    7. 7. Website is “slow”!
    8. 8. RCA • Why? – MySQL queries taking too long • Why? – Too many queries – Many slow queries – Queries locking tables • Why? – Capacity • Hmm…
    9. 9. Fixing it • Get beefier servers (the obvious) • Separate master_db, slave_db – Writes go to master_db – Reads from slave_db – Critical reads from master_db MySQL Reads Writes MySQL Master Writes MySQL Slave Reads Replication
    10. 10. Learning from it • Scale-out databases reads by distributing load across systems • Isolate database writes from reads – Writes are (usually) more critical
    11. 11. Website is “slow”! (Again)
    12. 12. RCA • Why? – MySQL queries taking too long (on slave_db) • Why? – Too many queries – Many slow queries • Why? – Queries from analytics / reporting and other backend jobs • Urm…
    13. 13. Fixing it • Analytics / reporting DB (archival_db) – Use MyISAM — optimized for reads – Additional indexes for quicker reporting MySQL Master Website Writes MySQL Slave Website Reads Analytics Reads Replication MySQL Master Website Writes MySQL Slave 1 Website Reads Replication MySQL Slave 2 Analytics Reads Replication
    14. 14. Learning from it • Isolate the databases being used for serving website traffic from those being used for analytical/reporting • Isolate systems being used by production website from those being used for background processing
    15. 15. BABY (2010 – 2011) Learning the basics
    16. 16. Website is “slow”!
    17. 17. RCA • Why? • How? – Instrumentation
    18. 18. RCA - 1 • Why? – Logging a lot – PHP processes blocking on writing logs Log file Request1 -> Process1 Request2 -> Process2 Request3 -> Process3 Waiting Request2 :Process1 Waiting Request2 :Process2 Writing Request3 :Process3
    19. 19. RCA - 2 • Why? – Service Oriented Architecture (SOA) – Too many calls to remote services per request • Creating fresh connection for each call • All the calls are made in serial order Receive request Connect to Service1 Request Service1 Connect Service2 Request Service2 Send response
    20. 20. RCA - 3 • Why? – Configurability – Fetch a lot of “config” from database for serving each request
    21. 21. RCA – 1,2,3 • Why? – Logging a lot – SOA – Configurability • Why? – PHP’s process model • Argh!
    22. 22. Fixing it • fk-w3-agent – Simple Java “middleware” daemon – Deployed on each web server – PHP communicates to it through local socket – Hosts pluggable “handlers”
    23. 23. fk-w3-agent: LoggingHandler Log file Request1 -> Process1 Request2 -> Process2 Request3 -> Process3 fk-w3- agent Request1 -> Process1 Request2 -> Process2 Request3 -> Process3 Log file Async / buffered
    24. 24. fk-w3-agent: ServiceHandler(s) Receive request Call fk-w3-agent Send response fk-w3- agent Service1 Service2 Receive request Connect to Service1 Request Service1 Connect Service2 Request Service2 Send response
    25. 25. fk-w3-agent: ConfigHandler Receive request Fetch Config1 Fetch Config2 Fetch Config3 Fetch Config4 Send response Database Receive request Fetch all config from fk-w3-agent Send response fk-w3- agent Database Poll and cache
    26. 26. Learning from it • PHP — good for frontend and templating – Gives a lot of agility – Limiting process model • Hurdle for high performance • Java — stability and performance • Horses for courses
    27. 27. Website is “slow”! (Again)
    28. 28. RCA • Why? – PHP processes taking up too much time – PHP processes taking up too much CPU • Why? – Product info deserialization taking up time/CPU – View construction taking up time/CPU
    29. 29. Fixing it • Caching! • Cache fully constructed pages – For a few minutes – Only for highly trafficked pages (Homepage) • Cache PHP serialized Product objects – ~20 million objects – Memcache • Yeah! But… – Add caching => add complexity
    30. 30. Caching: Complications (1) • “Caching fully constructed pages” • But parts of pages still need to be dynamic • Example: Logged-in user’s name • Impossible to do effective bucket testing • Or at least makes it prohibitively complex
    31. 31. Caching: Complications (2) • “Caching PHP serialized Product objects” • Without caching: • With caching, cache hit: • With caching, cache miss: getProductInfo() Fetch from CMS getProductInfo() Fetch from Cache getProductInfo() Fetch from Cache Fetch from CMS Set in Cache
    32. 32. Caching: Complications (3) • TTL: ∞ (i.e. no invalidation) • Pro-actively repopulate products in the cache – Receive “notifications” about product updates • Notification Server — pushes notifications raised by CMS • Use a persistent, distributed cache – Memcache => Membase, Couchbase
    33. 33. Learning from it • Caching is a powerful tool for performance optimization • Caching adds complexities – Reduced by keeping cache close to data source – Think deeply about TTL, invalidation • Use caching to go from “acceptable performance” to “awesome performance” – Don’t rely on it to get to “acceptable performance”
    34. 34. KID (2012) Growing up
    35. 35. Website is “slow”!
    36. 36. RCA • Why? – Search-service is slow (or Reviews-service is slow or Recommendations-service is slow) • But why is rest of website slow? – Requests to the slow service are blocking processing threads • Eh?!
    37. 37. Let’s do some math • Let’s say – Mean (or median) response time: 100 ms – 8-core server – All requests are CPU bound • Throughput: 80 requests per second (rps) • Let’s also say – 95th Percentile response time: 1000 ms • Call them “bad requests” • 4 bad requests in a second – Throughput down to 44 rps • 8 bad requests in a second? – Throughput down to 8 rps
    38. 38. Fixing it • Aggressive timeouts for all service calls – Isolate impact of a slow service • only to pages that depend on it • Very aggressive timeouts for non-critical services – Example: Recommendations • On a Product page, Search results page etc. • Not on My Recommendations page • Load non-critical parts of pages through AJAX
    39. 39. Learning from it • Isolate the impact of a poorly performing services / systems • Isolate the required from the good-to-have
    40. 40. Website is “slow”! (Again)
    41. 41. RCA • Why? – Load average of web servers has spiked • Why? – Requests per second has spiked • From 1000 rps to 1500 rps • Why? – Large number of notifications of product information updates
    42. 42. Fixing it • Separate cluster for receiving product info update notifications from the cluster that serves users • Admission control: Don’t let a system receive more requests than it can handle – Throttling • Batch the notifications
    43. 43. Learning from it • Isolate the systems serving internal requests from those serving production traffic • Admission control to ensure that a system is isolated from the over-enthusiasm of a client • Look at the granularity at which we’re working
    44. 44. TEENAGER Increasing complexity
    45. 45. THANK YOU
    46. 46. Mistake? • Sub-optimal decision – Not all information/scenarios considered – Insufficient information – Built for a different scenario • Due to focus on “functional” aspects • A mistake is a mistake – … in retrospect

    ×