SlideShare a Scribd company logo
1 of 47
Flipkart Website Architecture

      Mistakes & Learnings

          Siddhartha Reddy
          Architect, Flipkart
June 2007
November 2007
December 2012
www.flipkart.com
• Started in 2007
• Current Architecture from mid 2010
• Evolution of the architecture presented as…

       Issue[1]             RCA[2]   Actions   Learnings




•   *1+ Issue: Website is “slow”
•   [2] RCA = Root Cause Analysis
Surviving & reacting to the environment

INFANCY (2007 – MID-2010)
Website is “slow”!
RCA
• Why?
  – MySQL queries taking too long
• Why?
  – Too many queries
  – Many slow queries
  – Queries locking tables
• Why?
  – Capacity
• Hmm…
Fixing it
• Get beefier servers (the obvious)
• Separate master_db, slave_db
  – Writes go to master_db
  – Reads from slave_db
  – Critical reads from master_db
                              Writes                 Reads
   Reads           Writes

           MySQL              MySQL                  MySQL
                                       Replication   Slave
                              Master
Learning from it
• Scale-out databases reads by distributing load
  across systems
• Isolate database writes from reads
  – Writes are (usually) more critical
Website is “slow”!
    (Again)
RCA
• Why?
  – MySQL queries taking too long (on slave_db)
• Why?
  – Too many queries
  – Many slow queries
• Why?
  – Queries from analytics / reporting and other
    backend jobs
• Urm…
Fixing it
• Analytics / reporting DB (archival_db)
    – Use MyISAM — optimized for reads
    – Additional indexes for quicker reporting
                                           Website                  Website
                                           Writes                    Reads
Website                 Website
Writes                   Reads

                                           MySQL                    MySQL
                                                      Replication   Slave 1
                                           Master
MySQL                   MySQL
          Replication   Slave
Master                                          Replication

                        Analytics           MySQL                   Analytics
                         Reads              Slave 2                  Reads
Learning from it
• Isolate the databases being used for serving
  website traffic from those being used for
  analytical/reporting
• Isolate systems being used by production
  website from those being used for background
  processing
Learning the basics

BABY (2010 – 2011)
Website is “slow”!
RCA
• Why?
• How?
  – Instrumentation
RCA - 1
• Why?
     – Logging a lot
     – PHP processes blocking on writing logs
               Request2
              -> Process2




                                                                                      Writing
                                          Waiting




                                                                Waiting
Request1                    Request3                Request2              Request2              Request3
-> Process1                 -> Process3             :Process1             :Process2             :Process3

              Log file
RCA - 2
• Why?
  – Service Oriented Architecture (SOA)
  – Too many calls to remote services per request
     • Creating fresh connection for each call
     • All the calls are made in serial order


                     Connect to   Request    Connect    Request      Send
   Receive request
                      Service1    Service1   Service2   Service2   response
RCA - 3
• Why?
  – Configurability
  – Fetch a lot of “config” from database for serving
    each request
     Receive    Fetch     Fetch     Fetch     Fetch      Send
     request   Config1   Config2   Config3   Config4   response
RCA – 1,2,3
• Why?
  – Logging a lot
  – SOA
  – Configurability
• Why?
  – PHP’s process model
• Argh!
Fixing it
• fk-w3-agent
  – Simple Java “middleware” daemon
  – Deployed on each web server
  – PHP communicates to it through local socket
  – Hosts pluggable “handlers”
fk-w3-agent: LoggingHandler

               Request2                                 Request2
              -> Process2                               -> Process2
Request1                    Request3      Request1                     Request3
-> Process1                 -> Process3   -> Process1                 -> Process3


                                                         fk-w3-
              Log file                                    agent

                                                                 Async / buffered




                                                        Log file
fk-w3-agent: ServiceHandler(s)
                  Connect to     Request           Connect         Request       Send
Receive request
                   Service1      Service1          Service2        Service2    response




                                            Call
         Receive request                                             Send response
                                      fk-w3-agent


                                        fk-w3-
                                        agent

                      Service1                                Service2
fk-w3-agent: ConfigHandler
Receive      Fetch     Fetch        Fetch          Fetch      Send
request     Config1   Config2      Config3        Config4   response




                             Database

                       Fetch all config from
    Receive request                                Send response
                           fk-w3-agent

                           fk-w3-
                            agent
                                 Poll and cache



                          Database
Learning from it
• PHP — good for frontend and templating
  – Gives a lot of agility
  – Limiting process model
     • Hurdle for high performance
• Java — stability and performance
• Horses for courses
Website is “slow”!
    (Again)
RCA
• Why?
  – PHP processes taking up too much time
  – PHP processes taking up too much CPU
• Why?
  – Product info deserialization taking up time/CPU
  – View construction taking up time/CPU
Fixing it
• Caching!
• Cache fully constructed pages
  – For a few minutes
  – Only for highly trafficked pages (Homepage)
• Cache PHP serialized Product objects
  – ~20 million objects
  – Memcache
• Yeah! But…
  – Add caching => add complexity
Caching: Complications (1)
• “Caching fully constructed pages”
• But parts of pages still need to be dynamic
     • Example: Logged-in user’s name
• Impossible to do effective bucket testing
     • Or at least makes it prohibitively complex
Caching: Complications (2)
• “Caching PHP serialized Product objects”
• Without caching:
              getProductInfo()            Fetch from CMS

• With caching, cache hit:
              getProductInfo()           Fetch from Cache

• With caching, cache miss:
                         Fetch from   Fetch from
      getProductInfo()                             Set in Cache
                           Cache         CMS
Caching: Complications (3)
• TTL: ∞ (i.e. no invalidation)
• Pro-actively repopulate products in the cache
  – Receive “notifications” about product updates
     • Notification Server — pushes notifications raised by
       CMS
• Use a persistent, distributed cache
  – Memcache => Membase, Couchbase
Learning from it
• Caching is a powerful tool for performance
  optimization
• Caching adds complexities
  – Reduced by keeping cache close to data source
  – Think deeply about TTL, invalidation
• Use caching to go from “acceptable
  performance” to “awesome performance”
  – Don’t rely on it to get to “acceptable
    performance”
Growing up

KID (2012)
Website is “slow”!
RCA
• Why?
  – Search-service is slow (or Reviews-service is slow
    or Recommendations-service is slow)
• But why is rest of website slow?
  – Requests to the slow service are blocking
    processing threads
• Eh?!
Let’s do some math
• Let’s say
   – Mean (or median) response time: 100 ms
   – 8-core server
   – All requests are CPU bound
• Throughput: 80 requests per second (rps)
• Let’s also say
   – 95th Percentile response time: 1000 ms
       • Call them “bad requests”
• 4 bad requests in a second
   – Throughput down to 44 rps
• 8 bad requests in a second?
   – Throughput down to 8 rps
Fixing it
• Aggressive timeouts for all service calls
  – Isolate impact of a slow service
     • only to pages that depend on it
• Very aggressive timeouts for non-critical
  services
  – Example: Recommendations
     • On a Product page, Search results page etc.
     • Not on My Recommendations page
• Load non-critical parts of pages through AJAX
Learning from it
• Isolate the impact of a poorly performing
  services / systems
• Isolate the required from the good-to-have
Website is “slow”!
    (Again)
RCA
• Why?
  – Load average of web servers has spiked
• Why?
  – Requests per second has spiked
     • From 1000 rps to 1500 rps
• Why?
  – Large number of notifications of product
    information updates
Fixing it
• Separate cluster for receiving product info
  update notifications from the cluster that
  serves users
• Admission control: Don’t let a system receive
  more requests than it can handle
  – Throttling
• Batch the notifications
Learning from it
• Isolate the systems serving internal requests
  from those serving production traffic
• Admission control to ensure that a system is
  isolated from the over-enthusiasm of a client
• Look at the granularity at which we’re working
Increasing complexity

TEENAGER
THANK YOU
Mistake?
• Sub-optimal decision
  – Not all information/scenarios considered
  – Insufficient information
  – Built for a different scenario
• Due to focus on “functional” aspects
• A mistake is a mistake
  – … in retrospect

More Related Content

Viewers also liked

Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web appsDirecti Group
 
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...slashn
 
Fungus on White Bread
Fungus on White BreadFungus on White Bread
Fungus on White BreadGaurav Lochan
 
Continuous deployment-at-flipkart
Continuous deployment-at-flipkartContinuous deployment-at-flipkart
Continuous deployment-at-flipkartPankaj Kaushal
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M usersJongyoon Choi
 
Architecture of a Modern Web App
Architecture of a Modern Web AppArchitecture of a Modern Web App
Architecture of a Modern Web Appscothis
 
Slash n: Tech Talk Track 1 – Experimentation Platform - Ashok Banerjee
Slash n: Tech Talk Track 1 – Experimentation Platform - Ashok BanerjeeSlash n: Tech Talk Track 1 – Experimentation Platform - Ashok Banerjee
Slash n: Tech Talk Track 1 – Experimentation Platform - Ashok Banerjeeslashn
 
Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal, V...
Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal,  V...Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal,  V...
Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal, V...slashn
 
Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...
Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...
Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...slashn
 
Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...
Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...
Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...slashn
 
Driving User Growth Through Online Marketing
Driving User Growth Through Online MarketingDriving User Growth Through Online Marketing
Driving User Growth Through Online Marketingslashn
 
Introduction to NoSQL db and mongoDB
Introduction to NoSQL db and mongoDBIntroduction to NoSQL db and mongoDB
Introduction to NoSQL db and mongoDBbackslash451
 
Slash n: Technical Session 8 - Making Time - minute by minute - Janmejay Singh
Slash n: Technical Session 8 - Making Time - minute by minute - Janmejay SinghSlash n: Technical Session 8 - Making Time - minute by minute - Janmejay Singh
Slash n: Technical Session 8 - Making Time - minute by minute - Janmejay Singhslashn
 
Soa design pattern
Soa design patternSoa design pattern
Soa design patternLap Doan
 
INFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COM
INFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COMINFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COM
INFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COMMilan49
 
FlipkartFLIPKART USE IT AND INFORMATION SYSTEM
FlipkartFLIPKART USE IT AND INFORMATION SYSTEMFlipkartFLIPKART USE IT AND INFORMATION SYSTEM
FlipkartFLIPKART USE IT AND INFORMATION SYSTEMtigerjayadev
 
High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...Robert Mederer
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Regunath B
 

Viewers also liked (20)

How Flipkart scales PHP
How Flipkart scales PHPHow Flipkart scales PHP
How Flipkart scales PHP
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web apps
 
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
 
Fungus on White Bread
Fungus on White BreadFungus on White Bread
Fungus on White Bread
 
Continuous deployment-at-flipkart
Continuous deployment-at-flipkartContinuous deployment-at-flipkart
Continuous deployment-at-flipkart
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
Flipkart
FlipkartFlipkart
Flipkart
 
Architecture of a Modern Web App
Architecture of a Modern Web AppArchitecture of a Modern Web App
Architecture of a Modern Web App
 
Slash n: Tech Talk Track 1 – Experimentation Platform - Ashok Banerjee
Slash n: Tech Talk Track 1 – Experimentation Platform - Ashok BanerjeeSlash n: Tech Talk Track 1 – Experimentation Platform - Ashok Banerjee
Slash n: Tech Talk Track 1 – Experimentation Platform - Ashok Banerjee
 
Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal, V...
Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal,  V...Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal,  V...
Slash n: Technical Session 2 - Messaging as a Platform - Shashwat Agarwal, V...
 
Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...
Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...
Slash n: Technical Session 6 - Keeping a commercial site secure – A case stud...
 
Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...
Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...
Slash n: Technical Session 7 - Fraudsters are smart, Frank is smarter - Vivek...
 
Driving User Growth Through Online Marketing
Driving User Growth Through Online MarketingDriving User Growth Through Online Marketing
Driving User Growth Through Online Marketing
 
Introduction to NoSQL db and mongoDB
Introduction to NoSQL db and mongoDBIntroduction to NoSQL db and mongoDB
Introduction to NoSQL db and mongoDB
 
Slash n: Technical Session 8 - Making Time - minute by minute - Janmejay Singh
Slash n: Technical Session 8 - Making Time - minute by minute - Janmejay SinghSlash n: Technical Session 8 - Making Time - minute by minute - Janmejay Singh
Slash n: Technical Session 8 - Making Time - minute by minute - Janmejay Singh
 
Soa design pattern
Soa design patternSoa design pattern
Soa design pattern
 
INFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COM
INFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COMINFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COM
INFORMATION SYSTEM FOR MANAGER CONCEPTS RELATED TO FLIPKART.COM
 
FlipkartFLIPKART USE IT AND INFORMATION SYSTEM
FlipkartFLIPKART USE IT AND INFORMATION SYSTEMFlipkartFLIPKART USE IT AND INFORMATION SYSTEM
FlipkartFLIPKART USE IT AND INFORMATION SYSTEM
 
High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...High Scalability by Example – How can Web-Architecture scale like Facebook, T...
High Scalability by Example – How can Web-Architecture scale like Facebook, T...
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3
 

Similar to Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Architectures with Windows Azure
Architectures with Windows AzureArchitectures with Windows Azure
Architectures with Windows AzureDamir Dobric
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...SQLExpert.pl
 
Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Yogi Kulkarni
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutSander Temme
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?Jagadish Venkatraman
 
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...HostedbyConfluent
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationNitin Sharma
 
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffySql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffyAnuradha
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
 
Infinispan from POC to Production
Infinispan from POC to ProductionInfinispan from POC to Production
Infinispan from POC to ProductionJBUG London
 
Infinispan from POC to Production
Infinispan from POC to ProductionInfinispan from POC to Production
Infinispan from POC to ProductionC2B2 Consulting
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NETDavid Giard
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackSadayuki Furuhashi
 

Similar to Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy (20)

Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Architectures with Windows Azure
Architectures with Windows AzureArchitectures with Windows Azure
Architectures with Windows Azure
 
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
 
Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling Out
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffySql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffy
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Infinispan from POC to Production
Infinispan from POC to ProductionInfinispan from POC to Production
Infinispan from POC to Production
 
Infinispan from POC to Production
Infinispan from POC to ProductionInfinispan from POC to Production
Infinispan from POC to Production
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Cdn cs6740
Cdn cs6740Cdn cs6740
Cdn cs6740
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NET
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 

Recently uploaded (20)

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 

Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Siddhartha Reddy

  • 1. Flipkart Website Architecture Mistakes & Learnings Siddhartha Reddy Architect, Flipkart
  • 5. www.flipkart.com • Started in 2007 • Current Architecture from mid 2010 • Evolution of the architecture presented as… Issue[1] RCA[2] Actions Learnings • *1+ Issue: Website is “slow” • [2] RCA = Root Cause Analysis
  • 6. Surviving & reacting to the environment INFANCY (2007 – MID-2010)
  • 8. RCA • Why? – MySQL queries taking too long • Why? – Too many queries – Many slow queries – Queries locking tables • Why? – Capacity • Hmm…
  • 9. Fixing it • Get beefier servers (the obvious) • Separate master_db, slave_db – Writes go to master_db – Reads from slave_db – Critical reads from master_db Writes Reads Reads Writes MySQL MySQL MySQL Replication Slave Master
  • 10. Learning from it • Scale-out databases reads by distributing load across systems • Isolate database writes from reads – Writes are (usually) more critical
  • 12. RCA • Why? – MySQL queries taking too long (on slave_db) • Why? – Too many queries – Many slow queries • Why? – Queries from analytics / reporting and other backend jobs • Urm…
  • 13. Fixing it • Analytics / reporting DB (archival_db) – Use MyISAM — optimized for reads – Additional indexes for quicker reporting Website Website Writes Reads Website Website Writes Reads MySQL MySQL Replication Slave 1 Master MySQL MySQL Replication Slave Master Replication Analytics MySQL Analytics Reads Slave 2 Reads
  • 14. Learning from it • Isolate the databases being used for serving website traffic from those being used for analytical/reporting • Isolate systems being used by production website from those being used for background processing
  • 15. Learning the basics BABY (2010 – 2011)
  • 17. RCA • Why? • How? – Instrumentation
  • 18. RCA - 1 • Why? – Logging a lot – PHP processes blocking on writing logs Request2 -> Process2 Writing Waiting Waiting Request1 Request3 Request2 Request2 Request3 -> Process1 -> Process3 :Process1 :Process2 :Process3 Log file
  • 19. RCA - 2 • Why? – Service Oriented Architecture (SOA) – Too many calls to remote services per request • Creating fresh connection for each call • All the calls are made in serial order Connect to Request Connect Request Send Receive request Service1 Service1 Service2 Service2 response
  • 20. RCA - 3 • Why? – Configurability – Fetch a lot of “config” from database for serving each request Receive Fetch Fetch Fetch Fetch Send request Config1 Config2 Config3 Config4 response
  • 21. RCA – 1,2,3 • Why? – Logging a lot – SOA – Configurability • Why? – PHP’s process model • Argh!
  • 22. Fixing it • fk-w3-agent – Simple Java “middleware” daemon – Deployed on each web server – PHP communicates to it through local socket – Hosts pluggable “handlers”
  • 23. fk-w3-agent: LoggingHandler Request2 Request2 -> Process2 -> Process2 Request1 Request3 Request1 Request3 -> Process1 -> Process3 -> Process1 -> Process3 fk-w3- Log file agent Async / buffered Log file
  • 24. fk-w3-agent: ServiceHandler(s) Connect to Request Connect Request Send Receive request Service1 Service1 Service2 Service2 response Call Receive request Send response fk-w3-agent fk-w3- agent Service1 Service2
  • 25. fk-w3-agent: ConfigHandler Receive Fetch Fetch Fetch Fetch Send request Config1 Config2 Config3 Config4 response Database Fetch all config from Receive request Send response fk-w3-agent fk-w3- agent Poll and cache Database
  • 26. Learning from it • PHP — good for frontend and templating – Gives a lot of agility – Limiting process model • Hurdle for high performance • Java — stability and performance • Horses for courses
  • 28. RCA • Why? – PHP processes taking up too much time – PHP processes taking up too much CPU • Why? – Product info deserialization taking up time/CPU – View construction taking up time/CPU
  • 29. Fixing it • Caching! • Cache fully constructed pages – For a few minutes – Only for highly trafficked pages (Homepage) • Cache PHP serialized Product objects – ~20 million objects – Memcache • Yeah! But… – Add caching => add complexity
  • 30. Caching: Complications (1) • “Caching fully constructed pages” • But parts of pages still need to be dynamic • Example: Logged-in user’s name • Impossible to do effective bucket testing • Or at least makes it prohibitively complex
  • 31. Caching: Complications (2) • “Caching PHP serialized Product objects” • Without caching: getProductInfo() Fetch from CMS • With caching, cache hit: getProductInfo() Fetch from Cache • With caching, cache miss: Fetch from Fetch from getProductInfo() Set in Cache Cache CMS
  • 32. Caching: Complications (3) • TTL: ∞ (i.e. no invalidation) • Pro-actively repopulate products in the cache – Receive “notifications” about product updates • Notification Server — pushes notifications raised by CMS • Use a persistent, distributed cache – Memcache => Membase, Couchbase
  • 33. Learning from it • Caching is a powerful tool for performance optimization • Caching adds complexities – Reduced by keeping cache close to data source – Think deeply about TTL, invalidation • Use caching to go from “acceptable performance” to “awesome performance” – Don’t rely on it to get to “acceptable performance”
  • 36. RCA • Why? – Search-service is slow (or Reviews-service is slow or Recommendations-service is slow) • But why is rest of website slow? – Requests to the slow service are blocking processing threads • Eh?!
  • 37. Let’s do some math • Let’s say – Mean (or median) response time: 100 ms – 8-core server – All requests are CPU bound • Throughput: 80 requests per second (rps) • Let’s also say – 95th Percentile response time: 1000 ms • Call them “bad requests” • 4 bad requests in a second – Throughput down to 44 rps • 8 bad requests in a second? – Throughput down to 8 rps
  • 38. Fixing it • Aggressive timeouts for all service calls – Isolate impact of a slow service • only to pages that depend on it • Very aggressive timeouts for non-critical services – Example: Recommendations • On a Product page, Search results page etc. • Not on My Recommendations page • Load non-critical parts of pages through AJAX
  • 39. Learning from it • Isolate the impact of a poorly performing services / systems • Isolate the required from the good-to-have
  • 41. RCA • Why? – Load average of web servers has spiked • Why? – Requests per second has spiked • From 1000 rps to 1500 rps • Why? – Large number of notifications of product information updates
  • 42. Fixing it • Separate cluster for receiving product info update notifications from the cluster that serves users • Admission control: Don’t let a system receive more requests than it can handle – Throttling • Batch the notifications
  • 43. Learning from it • Isolate the systems serving internal requests from those serving production traffic • Admission control to ensure that a system is isolated from the over-enthusiasm of a client • Look at the granularity at which we’re working
  • 45.
  • 47. Mistake? • Sub-optimal decision – Not all information/scenarios considered – Insufficient information – Built for a different scenario • Due to focus on “functional” aspects • A mistake is a mistake – … in retrospect

Editor's Notes

  1. “This has basically given us lots of opportunities to make mistakes. And make mistakes we did.”
  2. Website Architecture diagram goes here
  3. No