Building Modern Web Sites: A Story of Scalability and Availability


Published on

Video and slides synchronized, mp3 and slide download available at URL

Sid Anand uses examples from LinkedIn, Netflix, and eBay to discuss some common causes of outages and scaling issues. He also discusses modern practices in availability and scaling in web sites today. Filmed at

Siddharth "Sid" Anand has deep experience designing and scaling high-traffic web sites. Currently, he is a senior member of LinkedIn's Data Infrastructure team focusing on analytics infrastructure. Prior to joining LinkedIn, he served as Netflix's Cloud Database Architect, Etsy's VP of Engineering, a search engineer and researcher at eBay, and a performance engineer at Siebel Systems.

Published in: Technology, Design
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building Modern Web Sites: A Story of Scalability and Availability

  1. 1. Recruiting SolutionsRecruiting SolutionsRecruiting Solutions * *** Sid  Anand   QCon  NY  2013   Building  a  Modern  Website  for  Scale  
  2. 2. News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on! /website-outages
  3. 3. Presented at QCon New York Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. About Me 2 * Current Life… Ø  LinkedIn Ø  Search, Network, and Analytics (SNA) Ø  Search Infrastructure Ø  Me In a Previous Life… Ø  LinkedIn, Data Infrastructure, Architect Ø  Netflix, Cloud Database Architect Ø  eBay, Web Development, Research Lab, & Search Engine And Many Years Prior… Ø  Studying Distributed Systems at Cornell University @r39132 2
  5. 5. Our mission Connect the world’s professionals to make them more productive and successful 3@r39132 3
  6. 6. Over 200M members and counting 2 4 8 17 32 55 90 145 2004 2005 2006 2007 2008 2009 2010 2011 2012 LinkedIn Members (Millions) 200+ The world’s largest professional network Growing at more than 2 members/sec Source :
  7. 7. 5 * >88%Fortune 100 Companies use LinkedIn Talent Soln to hire Company Pages >2.9M Professional searches in 2012 >5.7B Languages 19 @r39132 5 >30MFastest growing demographic: Students and NCGs The world’s largest professional network Over 64% of members are now international Source :
  8. 8. Other Company Facts 6 * •  Headquartered in Mountain View, Calif., with offices around the world! •  As of June 1, 2013, LinkedIn has ~3,700 full-time employees located around the world @r39132 6 Source :
  9. 9. Agenda ü  Company Overview §  Serving Architecture §  How Does LinkedIn Scale –  Web Services –  Databases –  Messaging –  Other §  Q & A 7@r39132
  10. 10. Serving Architecture 8@r39132 8
  11. 11. Overview •  Our site runs primarily on Java, with some use of Scala for specific infrastructure •  The presentation tier is an exception – runs on everything! •  What runs on Scala? •  Network Graph Engine •  Kafka •  Some front ends (Play) •  Most of our services run on Jetty LinkedIn : Serving Architecture @r39132 9
  12. 12. LinkedIn : Serving Architecture @r39132 10 Frontier Presentation Tier Play Spring MVC NodeJS JRuby Grails Django USSR (Chrome V8 JS Engine) Our presentation tier is composed of ATS with 2 plugins: •  Fizzy •  A content aggregator that unifies content across a diverse set of front-ends •  Open-source JS templating framework •  USSR (a.k.a. Unified Server-Side Rendering) •  Packages Google Chrome’s V8 JS engine as an ATS plugin
  13. 13. A A B B Master C C D D E E F F Presentation Tier Business Service Tier Data Service Tier Data Infrastructure Slave Master Master Memcached à  A web page requests information A and B à  A thin layer focused on building the UI. It assembles the page by making parallel requests to BST services à  Encapsulates business logic. Can call other BST clusters and its own DST cluster. à  Encapsulates DAL logic and concerned with one Oracle Schema. à  Concerned with the persistent storage of and easy access to data LinkedIn : Serving Architecture Hadoop @r39132 11 Other
  14. 14. 12 Serving Architecture : Other? Oracle or Espresso Data Change Events Search Index Graph Index Read Replicas Updates Standard ization As I will discuss later, data that is committed to Databases needs to also be made available to a host of other online serving systems : •  Search •  Standardization Services •  These provider canonical names for your titles, companies, schools, skills, fields of study, etc.. •  Graph engine •  Recommender Systems This data change feed needs to be scalable, reliable, and fast. [ Databus ] @r39132
  15. 15. 13 Serving Architecture : Hadoop @r39132 How do we Hadoop to Serve? •  Hadoop is central to our analytic infrastructure •  We ship data streams into Hadoop from our primary Databases via Databus & from applications via Kafka •  Hadoop jobs take daily or hourly dumps of this data and compute data files that Voldemort can load! •  Voldemort loads these files and serves them on the site
  16. 16. Voldemort : RO Store Usage at LinkedIn People You May Know LinkedIn Skills Related Searches Viewers of this profile also viewed Events you may be interested in Jobs you may be interested in @r39132 14
  17. 17. How Does LinkedIn Scale? 15@r39132 15
  18. 18. Scaling Web Services LinkedIn : Web Services 16@r39132
  19. 19. LinkedIn : Scaling Web Services @r39132 17 Problem •  How do 150+ web services communicate with each other to fulfill user requests in the most efficient and fault-tolerant manner? •  How do they handle slow downstream dependencies? •  For illustration sake, consider the following scenario: •  Service B has 2 hosts •  Service C has 2 hosts •  A machine in Service B sends a web request to a machine in Service C A A B B C C
  20. 20. LinkedIn : Scaling Web Services @r39132 18 What sorts of failure modes are we concerned about? •  A machine in service C •  has a long GC pause •  calls a service that has a long GC pause •  calls a service that calls a service that has a long GC pause •  … see where I am going? •  A machine in service C or in its downstream dependencies may be slow for any reason, not just GC (e.g. bottlenecks on CPU, IO, and memory, lock-contention) Goal : Given all of this, how can we ensure high uptime? Hint : Pick the right architecture and implement best-practices on top of it!
  21. 21. LinkedIn : Scaling Web Services @r39132 19 In the early days, LinkedIn made a big bet on Spring and Spring RPC. Issues 1.  Spring RPC is difficult to debug •  You cannot call the service using simple command-line tools like Curl •  Since the RPC call is implemented as a binary payload over HTTP, http access logs are not very useful B B C C LB 2.  A Spring RPC-based architecture leads to high MTTR •  Spring RPC is not flexible and pluggable -- we cannot use •  custom client-side load balancing strategies •  custom fault-tolerance features •  Instead, all we can do is to put all of our service nodes behind a hardware load-balancer & pray! •  If a Service C node experiences a slowness issue, a NOC engineer needs to be alerted and then manually remove it from the LB (MTTR > 30 minutes)
  22. 22. LinkedIn : Scaling Web Services @r39132 20 Solution A better solution is one that we see often in both cloud-based architectures and NoSQL systems : Dynamic Discovery + Client-side load-balancing Step 1 : Service C nodes announce their availability to serve traffic to a ZK registry Step 2 : Service B nodes get updates from ZK B B C C ZK ZK ZK B B C C ZK ZK ZK Step 3 : Service B nodes route traffic to service C nodes B B C C ZK ZK ZK
  23. 23. LinkedIn : Scaling Web Services @r39132 21 With this new paradigm for discovering services and routing requests to them, we can incorporate additional fault-tolerant services
  24. 24. LinkedIn : Scaling Web Services @r39132 22 Best Practices •  Fault-tolerance Support 1.  No client should wait indefinitely for a response from a service •  Issues •  Waiting causes a traffic jam : all upstream clients end up also getting blocked •  Each service has a fixed number of Jetty or Tomcat threads. Once those are all tied up waiting, no new requests can be handled •  Solution •  After a configurable timeout, return •  Store different SLAs in ZK for each REST end-points •  In other words, all calls are not the same and should not have the same read time out
  25. 25. LinkedIn : Scaling Web Services @r39132 23 Best Practices •  Fault-tolerance Support 2.  Isolate calls to back-ends from one another •  Issues •  You depend on a responses from independent services A and B. If A slows down, will you still be able to serve B? •  Details •  This is a common use-case for federated services and for shard- aggregators : •  E.g. Search at LinkedIn is federated and will call people- search, job-search, group-search, etc... In parallel •  E.g. People-search is itself sharded, so an additional shard- aggregation step needs to happen across 100s of shards •  Solution •  Use Async requests or independent ExecutorServices for sync requests (one per each shard or vertical)
  26. 26. LinkedIn : Scaling Web Services @r39132 24 Best Practices •  Fault-tolerance Support 3.  Cancel Unnecessary Work •  Issues •  Work issued down the call-graphs is unnecessary if the clients at the top of the call graph have already timed out •  Imagine that as a call reaches half-way down your call-tree, the caller at the root times out. •  You will still issue work down the remaining half-depth of your tree unless you cancel it! •  Solution •  A possible approach •  Root of the call-tree adds (<tree-UUID>, inProgress status) to Memcached •  All services pass the tree-UUID down the call-tree (e.g. as a HTTP custom request header) •  Servlet filters at each hop check whether inProgress == false. If so, immediately respond with an empty response
  27. 27. LinkedIn : Scaling Web Services @r39132 25 Best Practices •  Fault-tolerance Support 4.  Avoid Sending Requests to Hosts that are GCing •  Issues •  If a client sends a web request to a host in Service C and if that host is experiencing a GC pause, the client will wait 50-200ms, depending on the read time out for the request •  During that GC pause other requests will also be sent to that node before they all eventually time out •  Solution •  Send a “GC scout” request before every “real” web request
  28. 28. LinkedIn : Scaling Web Services @r39132 26 Why is this a good idea? •  Scout requests are cheap and provide negligible overhead for requests Step 1 : A Service B node sends a cheap 1 msec TCP request to a dedicated “scout” Netty port Step 2 : If the scout request comes back within 1 msec, send the real request to the Tomcat or Jetty port Step 3 : Else repeat with a different host in Service C B B Netty Tomcat ZK ZK ZK C B B Netty Tomcat ZK ZK ZK C B B ZK ZK ZK Netty Tomcat C Netty Tomcat C
  29. 29. LinkedIn : Scaling Web Services @r39132 27 Best Practice •  Fault-tolerance Support 5.  Services should protect themselves from traffic bursts •  Issues •  Service nodes should protect themselves from being over-whelmed by requests •  This will also protect their downstream servers from being overwhelmed •  Simply setting the tomcat or jetty thread pool size is not always an option. Often times, these are not configurable per application. •  Solution •  Use a sliding window counter. If the counter exceeds a configured threshold, return immediately with a 503 (‘service unavailable’) •  Set threshold below Tomcat or Jetty thread pool size
  30. 30. Espresso : Scaling Databases LinkedIn : Databases 28@r39132
  31. 31. Espresso : Overview @r39132 29 Problem •  What do we do when we run out of QPS capacity on an Oracle database server? •  You can only buy yourself out of this problem so far (i.e. buy a bigger box) •  Read-replicas and memcached will help scale reads, but not writes! Solution à Espresso You need a horizontally-scalable database! Espresso is LinkedIn’s newest NoSQL store. It offers the following features: •  Horizontal Scalability •  Works on commodity hardware •  Document-centric •  Avro documents supporting rich-nested data models •  Schema-evolution is drama free •  Extensions for Lucene indexing •  Supports Transactions (within a partition, e.g. memberId) •  Supports conditional reads & writes using standard HTTP headers (e.g. if-modified-since)
  32. 32. Espresso : Overview @r39132 30 Why not use Open-source? •  Change capture stream (e.g. Databus) •  Backup-restore •  Mature storage-engine (innodb)
  33. 33. 31 •  Components •  Request Routing Tier •  Consults Cluster Manager to discover node to route to •  Forwards request to appropriate storage node •  Storage Tier •  Data Store (MySQL) •  Local Secondary Index (Lucene) •  Cluster Manager •  Responsible for data set partitioning •  Manages storage nodes •  Relay Tier •  Replicates data to consumers Espresso: Architecture @r39132
  34. 34. Databus : Scaling Databases LinkedIn : Database Streams 32@r39132
  35. 35. 33 DataBus : Overview Problem Our databases (Oracle & Espresso) are used for R/W web-site traffic. However, various services (Search, Graph DB, Standardization, etc…) need the ability to •  Read the data as it is changed in these OLTP stores •  Occasionally, scan the contents in order rebuild their entire state Solution è Databus Databus provides a consistent, in-time-order stream of database changes that •  Scales horizontally •  Protects the source database from high-read-load @r39132
  36. 36. Where Does LinkedIn use DataBus? 34@r39132 34
  37. 37. 35 DataBus : Usage @ LinkedIn Oracle or Espresso Data Change Events Search Index Graph Index Read Replicas Updates Standard ization A user updates the company, title, & school on his profile. He also accepts a connection •  The write is made to an Oracle or Espresso Master and DataBus replicates: •  the profile change is applied to the Standardization service Ø  E.g. the many forms of IBM were canonicalized for search-friendliness and recommendation-friendliness •  the profile change is applied to the Search Index service Ø  Recruiters can find you immediately by new keywords •  the connection change is applied to the Graph Index service Ø  The user can now start receiving feed updates from his new connections immediately @r39132
  38. 38. Relay Event Win 36 DB Bootstrap Capture Changes On-line Changes DB DataBus consists of 2 services •  Relay Service •  Sharded •  Maintain an in-memory buffer per shard •  Each shard polls Oracle and then deserializes transactions into Avro •  Bootstrap Service •  Picks up online changes as they appear in the Relay •  Supports 2 types of operations from clients Ø  If a client falls behind and needs records older than what the relay has, Bootstrap can send consolidated deltas! Ø  If a new client comes on line or if an existing client fell too far behind, Bootstrap can send a consistent snapshot DataBus : Architecture @r39132
  39. 39. Relay Event Win 37 DB Bootstrap Capture Changes On-line Changes On-line Changes DB Consolidated Delta Since T Consistent Snapshot at U Consumer 1 Consumer n Databus ClientLib Client Consumer 1 Consumer n Databus ClientLib Client Guarantees §  Transactions §  In-commit-order Delivery à commits are replicated in order §  Durability à you can replay the change stream at any time in the future §  Reliability à 0% data loss §  Low latency à If your consumers can keep up with the relay à sub-second response time DataBus : Architecture @r39132
  40. 40. 38 DataBus : Architecture Cool Features §  Server-side (i.e. relay-side & bootstrap-side) filters §  Problem §  Say that your consuming service is sharded 100 ways §  e.g. Member Search Indexes sharded by member_id % 100 §  index_0, index_1, …, index_99 §  However, you have a single member Databus stream §  How do you avoid having every shard read data it is not interested in? §  Solution §  Easy, Databus already understands the notion of server-side filters §  It will only send updates to your consumer instance for the shard it is interested in @r39132
  41. 41. Kafka: Scaling Messaging LinkedIn : Messaging 39@r39132
  42. 42. 40 Kafka : Overview Problem We have Databus to stream changes that were committed to a database. How do we capture and stream high-volume data if we relax the requirement that the data needs long-term durability? •  In other words, the data can have limited retention Challenges •  Needs to handle a large volume of events •  Needs to be highly-available, scalable, and low-latency •  Needs to provide limited durability guarantees (e.g. data retained for a week) Solution è Kafka Kafka is a messaging system that supports topics. Consumers can subscribe to topics and read all data within the retention window. Consumers are then notified of new messages as they appear! @r39132
  43. 43. 41 Kafka is used at LinkedIn for a variety of business-critical needs: Examples: •  End-user Activity Tracking (a.k.a. Web Tracking) •  Emails opened •  Logins •  Pages Seen •  Executed Searches •  Social Gestures : Likes, Sharing, Comments •  Data Center Operational Metrics •  Network & System metrics such as •  TCP metrics (connection resets, message resends, etc…) •  System metrics (iops, CPU, load average, etc…) Kafka : Usage @ LinkedIn @r39132
  44. 44. 42 WebTier Topic 1 Broker Tier Push Events Topic 2 Topic N Zookeeper Message Id Management Topic, Partition Ownership Sequential write sendfile Kafka ClientLib Consumers Pull Events Iterator 1 Iterator n Topic à Message Id 100 MB/sec 200 MB/sec §  Pub/Sub §  Batch Send/Receive §  E2E Compression §  System Decoupling Features Guarantees §  At least once delivery §  Very high throughput §  Low latency (0.8) §  Durability (for a time period) §  Horizontally Scalable Kafka : Architecture @r39132 •  Average Unique Message @Peak •  writes/sec = 460k •  reads/sec: 2.3m •  # topics: 693 28 billion unique messages written per day Scale at LinkedIn
  45. 45. 43 Improvements in 0.8 •  Low Latency Features •  Kafka has always been designed for high-throughput, but E2E latency could have been as high as 30 seconds •  Feature 1 : Long-polling •  For high throughput requests, a consumer’s request for data will always be fulfilled •  For low throughput requests, a consumer’s request will likely return 0 bytes, causing the consumer to back-off and wait. What happens if data arrives on the broker in the meantime? •  As of 0.8, a consumer can “park” a request on the broker for as much as “m milliseconds have passed” •  If data arrives during this period, it is instantly returned to the consumer Kafka : Overview @r39132
  46. 46. 44 Improvements in 0.8 •  Low Latency Features •  In the past, data was not visible to a consumer until it was flushed to disk on the broker. •  Feature 2 : New Commit Protocol •  In 0.8, replicas and a new commit protocol has been introduced. As long as data has been replicated to the memory of all replicas, even if it has not been flushed to disk on any one of them, it is considered “committed” and becomes visible to consumers Kafka : Overview @r39132
  47. 47. §  Jay Kreps (Kafka) §  Neha Narkhede (Kafka) §  Kishore Gopalakrishna (Helix) §  Bob Shulman (Espresso) §  Cuong Tran (Perfomance & Scalability) §  Diego “Mono” Buthay (Search Infrastructure) 45 Acknowledgments @r39132
  48. 48. y Questions? 46@r39132 46
  49. 49. Watch the video with slide synchronization on! outages