Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy
Upcoming SlideShare
Loading in...5
×
 

Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

on

  • 4,489 views

 

Statistics

Views

Total Views
4,489
Views on SlideShare
4,023
Embed Views
466

Actions

Likes
8
Downloads
153
Comments
0

4 Embeds 466

http://funnel.hasgeek.com 459
https://funnel.hasgeek.com 5
https://www.linkedin.com 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • “This has basically given us lots of opportunities to make mistakes. And make mistakes we did.”
  • Website Architecture diagram goes here
  • No

Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy Presentation Transcript

  • Flipkart Website Architecture Mistakes & Learnings Siddhartha Reddy Architect, Flipkart
  • June 2007
  • November 2007
  • December 2012
  • www.flipkart.com• Started in 2007• Current Architecture from mid 2010• Evolution of the architecture presented as… Issue[1] RCA[2] Actions Learnings• *1+ Issue: Website is “slow”• [2] RCA = Root Cause Analysis
  • Surviving & reacting to the environmentINFANCY (2007 – MID-2010)
  • Website is “slow”!
  • RCA• Why? – MySQL queries taking too long• Why? – Too many queries – Many slow queries – Queries locking tables• Why? – Capacity• Hmm…
  • Fixing it• Get beefier servers (the obvious)• Separate master_db, slave_db – Writes go to master_db – Reads from slave_db – Critical reads from master_db Writes Reads Reads Writes MySQL MySQL MySQL Replication Slave Master
  • Learning from it• Scale-out databases reads by distributing load across systems• Isolate database writes from reads – Writes are (usually) more critical
  • Website is “slow”! (Again)
  • RCA• Why? – MySQL queries taking too long (on slave_db)• Why? – Too many queries – Many slow queries• Why? – Queries from analytics / reporting and other backend jobs• Urm…
  • Fixing it• Analytics / reporting DB (archival_db) – Use MyISAM — optimized for reads – Additional indexes for quicker reporting Website Website Writes ReadsWebsite WebsiteWrites Reads MySQL MySQL Replication Slave 1 MasterMySQL MySQL Replication SlaveMaster Replication Analytics MySQL Analytics Reads Slave 2 Reads
  • Learning from it• Isolate the databases being used for serving website traffic from those being used for analytical/reporting• Isolate systems being used by production website from those being used for background processing
  • Learning the basicsBABY (2010 – 2011)
  • Website is “slow”!
  • RCA• Why?• How? – Instrumentation
  • RCA - 1• Why? – Logging a lot – PHP processes blocking on writing logs Request2 -> Process2 Writing Waiting WaitingRequest1 Request3 Request2 Request2 Request3-> Process1 -> Process3 :Process1 :Process2 :Process3 Log file
  • RCA - 2• Why? – Service Oriented Architecture (SOA) – Too many calls to remote services per request • Creating fresh connection for each call • All the calls are made in serial order Connect to Request Connect Request Send Receive request Service1 Service1 Service2 Service2 response
  • RCA - 3• Why? – Configurability – Fetch a lot of “config” from database for serving each request Receive Fetch Fetch Fetch Fetch Send request Config1 Config2 Config3 Config4 response
  • RCA – 1,2,3• Why? – Logging a lot – SOA – Configurability• Why? – PHP’s process model• Argh!
  • Fixing it• fk-w3-agent – Simple Java “middleware” daemon – Deployed on each web server – PHP communicates to it through local socket – Hosts pluggable “handlers”
  • fk-w3-agent: LoggingHandler Request2 Request2 -> Process2 -> Process2Request1 Request3 Request1 Request3-> Process1 -> Process3 -> Process1 -> Process3 fk-w3- Log file agent Async / buffered Log file
  • fk-w3-agent: ServiceHandler(s) Connect to Request Connect Request SendReceive request Service1 Service1 Service2 Service2 response Call Receive request Send response fk-w3-agent fk-w3- agent Service1 Service2
  • fk-w3-agent: ConfigHandlerReceive Fetch Fetch Fetch Fetch Sendrequest Config1 Config2 Config3 Config4 response Database Fetch all config from Receive request Send response fk-w3-agent fk-w3- agent Poll and cache Database
  • Learning from it• PHP — good for frontend and templating – Gives a lot of agility – Limiting process model • Hurdle for high performance• Java — stability and performance• Horses for courses
  • Website is “slow”! (Again)
  • RCA• Why? – PHP processes taking up too much time – PHP processes taking up too much CPU• Why? – Product info deserialization taking up time/CPU – View construction taking up time/CPU
  • Fixing it• Caching!• Cache fully constructed pages – For a few minutes – Only for highly trafficked pages (Homepage)• Cache PHP serialized Product objects – ~20 million objects – Memcache• Yeah! But… – Add caching => add complexity
  • Caching: Complications (1)• “Caching fully constructed pages”• But parts of pages still need to be dynamic • Example: Logged-in user’s name• Impossible to do effective bucket testing • Or at least makes it prohibitively complex
  • Caching: Complications (2)• “Caching PHP serialized Product objects”• Without caching: getProductInfo() Fetch from CMS• With caching, cache hit: getProductInfo() Fetch from Cache• With caching, cache miss: Fetch from Fetch from getProductInfo() Set in Cache Cache CMS
  • Caching: Complications (3)• TTL: ∞ (i.e. no invalidation)• Pro-actively repopulate products in the cache – Receive “notifications” about product updates • Notification Server — pushes notifications raised by CMS• Use a persistent, distributed cache – Memcache => Membase, Couchbase
  • Learning from it• Caching is a powerful tool for performance optimization• Caching adds complexities – Reduced by keeping cache close to data source – Think deeply about TTL, invalidation• Use caching to go from “acceptable performance” to “awesome performance” – Don’t rely on it to get to “acceptable performance”
  • Growing upKID (2012)
  • Website is “slow”!
  • RCA• Why? – Search-service is slow (or Reviews-service is slow or Recommendations-service is slow)• But why is rest of website slow? – Requests to the slow service are blocking processing threads• Eh?!
  • Let’s do some math• Let’s say – Mean (or median) response time: 100 ms – 8-core server – All requests are CPU bound• Throughput: 80 requests per second (rps)• Let’s also say – 95th Percentile response time: 1000 ms • Call them “bad requests”• 4 bad requests in a second – Throughput down to 44 rps• 8 bad requests in a second? – Throughput down to 8 rps
  • Fixing it• Aggressive timeouts for all service calls – Isolate impact of a slow service • only to pages that depend on it• Very aggressive timeouts for non-critical services – Example: Recommendations • On a Product page, Search results page etc. • Not on My Recommendations page• Load non-critical parts of pages through AJAX
  • Learning from it• Isolate the impact of a poorly performing services / systems• Isolate the required from the good-to-have
  • Website is “slow”! (Again)
  • RCA• Why? – Load average of web servers has spiked• Why? – Requests per second has spiked • From 1000 rps to 1500 rps• Why? – Large number of notifications of product information updates
  • Fixing it• Separate cluster for receiving product info update notifications from the cluster that serves users• Admission control: Don’t let a system receive more requests than it can handle – Throttling• Batch the notifications
  • Learning from it• Isolate the systems serving internal requests from those serving production traffic• Admission control to ensure that a system is isolated from the over-enthusiasm of a client• Look at the granularity at which we’re working
  • Increasing complexityTEENAGER
  • THANK YOU
  • Mistake?• Sub-optimal decision – Not all information/scenarios considered – Insufficient information – Built for a different scenario• Due to focus on “functional” aspects• A mistake is a mistake – … in retrospect