Building a Modern Website for Scale (QCon NY 2013)

4,703 views

Published on

My talk on how LinkedIn scales for high-traffic and high-availability at QCon NY 2013

Published in: Technology, Design
0 Comments
25 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,703
On SlideShare
0
From Embeds
0
Number of Embeds
113
Actions
Shares
0
Downloads
106
Comments
0
Likes
25
Embeds 0
No embeds

No notes for slide
  • For us, fundamentally changing the way the world works begins with our mission statement: To connect the world’s professionals to make them more productive and successful. This means not only helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in.
  • We’re making great strides toward our mission:LinkedIn has over 225 million members, and we’re now adding more than two members per second. This is the fastest rate of absolute member growth in the company’s history. Sixty-four percent of LinkedIn members are currently located outside of the United States.LinkedIn counts executives from all 2012 Fortune 500 companies as members; its corporate talent solutions are used by 88 of the Fortune 100 companies.More than 2.9 million companies have LinkedIn Company Pages.LinkedIn members did over 5.7 billion professionally-oriented searches on the platform in 2012.[See http://press.linkedin.com/about for a complete list of LinkedIn facts and stats]
  • The functionality above is computed by Hadoop but served to the site via Voldemort We leverage these profiles and social graph to extract insightsSome examples are here like recommendation people you may know, jobs you may be interested in
  • Kafka provides best-effort message transport at high throughput with durability.For replication, consistency and ordering of messages with extremely low latency is critical.Databus is about timeline-consistent change capture between two persistent systems. This leads to slightly different design tradeoffs in the two systems.
  • What do we mean when we say that Databus is LinkedIn’s data-change propagation system.On change in the primary stores (e.g. the profiles DB, the connections DB, etc.), the changes are buffered in a broker (the Databus Relay). This can be either through push or pull. The relay can also capture the transactional semantics of updates.Clients poll (including long polls) for changes in the relay. A special client is the Bootstrap DB which allows long for long look-back queries into the history of changes. If a client falls behind the stream of change events in the relay, it will be automatically redirected the Bootstrap DB which can deliver a compressed delta of the changes since the last event seen by the client. By “compressed” we mean that only the latest change to a row is delivered. An extreme case is when a new machine is add to the client cluster and it needs to *bootstrap* its initial case. In this case, the Bootstrap DB will deliver a consistent snapshot of the data as of some point in time which can later be used to continue consumption from the relay.Databus provides …The guarantees given by Databus are …
  • Simple, Efficient, Persistent, High-Volume, Messaging SystemDecouple message production from consumptionPublish/Subscribe model:Producer(s) publish messages to a topicClient(s) subscribe to a topicPublishers PUSH to broker (sync & asynch options)Subscribers PULL from broker
  • Building a Modern Website for Scale (QCon NY 2013)

    1. 1. Recruiting SolutionsRecruiting SolutionsRecruiting Solutions****Sid AnandQCon NY 2013Building a Modern Website for Scale
    2. 2. About Me2*Current Life… LinkedIn Search, Network, and Analytics (SNA) Search Infrastructure MeIn a Previous Life… LinkedIn, Data Infrastructure, Architect Netflix, Cloud Database Architect eBay, Web Development, Research Lab, & Search EngineAnd Many Years Prior… Studying Distributed Systems at Cornell University@r39132 2
    3. 3. Our missionConnect the world‟s professionals to makethem more productive and successful3@r39132 3
    4. 4. Over 200M members and counting2 4 8173255901452004 2005 2006 2007 2008 2009 2010 2011 2012LinkedIn Members (Millions)200+The world‟s largest professional networkGrowing at more than 2 members/secSource :http://press.linkedin.com/about
    5. 5. 5*>88%Fortune 100 Companiesuse LinkedIn Talent Soln to hireCompany Pages>2.9MProfessional searches in 2012>5.7BLanguages19@r39132 5>30MFastest growing demographic:Students and NCGsThe world‟s largest professional networkOver 64% of members are now internationalSource :http://press.linkedin.com/about
    6. 6. Other Company Facts6*• Headquartered in Mountain View, Calif., with offices around the world!• As of June 1, 2013, LinkedIn has ~3,700 full-time employees located around theworld@r39132 6Source :http://press.linkedin.com/about
    7. 7. Agenda Company Overview Serving Architecture How Does LinkedIn Scale– Web Services– Databases– Messaging– Other Q & A7@r39132
    8. 8. Serving Architecture8@r39132 8
    9. 9. Overview• Our site runs primarily on Java, with some use of Scala for specificinfrastructure• The presentation tier is an exception – runs on everything!• What runs on Scala?• Network Graph Engine• Kafka• Some front ends (Play)• Most of our services run on JettyLinkedIn : Serving Architecture@r39132 9
    10. 10. LinkedIn : Serving Architecture@r39132 10FrontierPresentation TierPlaySpringMVCNodeJS JRuby Grails DjangoUSSR (Chrome V8 JS Engine)Our presentation tier is composed of ATS with 2 plugins:• Fizzy• A content aggregator that unifies content across a diverse set of front-ends• Open-source JS templating framework• USSR (a.k.a. Unified Server-Side Rendering)• Packages Google Chrome‟s V8 JS engine as an ATS plugin
    11. 11. A A B BMasterC CD D E E F FPresentation TierBusiness Service TierData Service TierData InfrastructureSlave Master MasterMemcached A web page requestsinformation A and B A thin layer focused onbuilding the UI. It assemblesthe page by making parallelrequests to BST services Encapsulates businesslogic. Can call other BSTclusters and its own DSTcluster. Encapsulates DAL logic andconcerned with one OracleSchema. Concerned with thepersistent storage of andeasy access to dataLinkedIn : Serving ArchitectureHadoop@r39132 11Other
    12. 12. 12Serving Architecture : Other?Oracle orEspresso Data Change EventsSearchIndexGraphIndexReadReplicasUpdatesStandardizationAs I will discuss later, data that is committed to Databases needs to also be madeavailable to a host of other online serving systems :• Search• Standardization Services• These provider canonical names for yourtitles, companies, schools, skills, fields of study, etc..• Graph engine• Recommender SystemsThis data change feed needs to be scalable, reliable, and fast. [ Databus ]@r39132
    13. 13. 13Serving Architecture : Hadoop@r39132How do we Hadoop to Serve?• Hadoop is central to our analytic infrastructure• We ship data streams into Hadoop from ourprimary Databases via Databus & fromapplications via Kafka• Hadoop jobs take daily or hourly dumps of thisdata and compute data files that Voldemort canload!• Voldemort loads these files and serves them onthe site
    14. 14. Voldemort : RO Store Usage at LinkedInPeople You May KnowLinkedIn SkillsRelated SearchesViewers of this profile also viewedEvents you may be interested in Jobs you may beinterested in@r39132 14
    15. 15. How Does LinkedIn Scale?15@r39132 15
    16. 16. Scaling Web ServicesLinkedIn : Web Services16@r39132
    17. 17. LinkedIn : Scaling Web Services@r39132 17Problem• How do 150+ web services communicate with each other to fulfill user requests inthe most efficient and fault-tolerant manner?• How do they handle slow downstream dependencies?• For illustration sake, consider the following scenario:• Service B has 2 hosts• Service C has 2 hosts• A machine in Service B sends a web request to a machine in Service CA A B B C C
    18. 18. LinkedIn : Scaling Web Services@r39132 18What sorts of failure modes are we concerned about?• A machine in service C• has a long GC pause• calls a service that has a long GC pause• calls a service that calls a service that has a long GC pause• … see where I am going?• A machine in service C or in its downstream dependencies may be slow for anyreason, not just GC (e.g. bottlenecks on CPU, IO, and memory, lock-contention)Goal : Given all of this, how can we ensure high uptime?Hint : Pick the right architecture and implement best-practices on top of it!
    19. 19. LinkedIn : Scaling Web Services@r39132 19In the early days, LinkedIn made a big bet on Spring and Spring RPC.Issues1. Spring RPC is difficult to debug• You cannot call the service using simple command-line tools like Curl• Since the RPC call is implemented as a binary payload over HTTP, http accesslogs are not very usefulB BC CLB2. A Spring RPC-based architecture leads to high MTTR• Spring RPC is not flexible and pluggable -- we cannot use• custom client-side load balancing strategies• custom fault-tolerance features• Instead, all we can do is to put all of our service nodesbehind a hardware load-balancer & pray!• If a Service C node experiences a slowness issue, aNOC engineer needs to be alerted and then manuallyremove it from the LB (MTTR > 30 minutes)
    20. 20. LinkedIn : Scaling Web Services@r39132 20SolutionA better solution is one that we see often in both cloud-based architectures andNoSQL systems : Dynamic Discovery + Client-side load-balancingStep 1 :Service C nodes announce theiravailability to serve traffic to a ZKregistryStep 2 :Service B nodes get updatesfrom ZKB BC CZKZKZKB BC CZKZKZKStep 3 :Service B nodes routetraffic to service C nodesB BC CZKZKZK
    21. 21. LinkedIn : Scaling Web Services@r39132 21With this new paradigm for discovering services and routing requests tothem, we can incorporate additional fault-tolerant services
    22. 22. LinkedIn : Scaling Web Services@r39132 22Best Practices• Fault-tolerance Support1. No client should wait indefinitely for a response from a service• Issues• Waiting causes a traffic jam : all upstream clients end up also gettingblocked• Each service has a fixed number of Jetty or Tomcat threads. Oncethose are all tied up waiting, no new requests can be handled• Solution• After a configurable timeout, return• Store different SLAs in ZK for each REST end-points• In other words, all calls are not the same and should not havethe same read time out
    23. 23. LinkedIn : Scaling Web Services@r39132 23Best Practices• Fault-tolerance Support2. Isolate calls to back-ends from one another• Issues• You depend on a responses from independent services A and B. If Aslows down, will you still be able to serve B?• Details• This is a common use-case for federated services and for shard-aggregators :• E.g. Search at LinkedIn is federated and will call people-search, job-search, group-search, etc... In parallel• E.g. People-search is itself sharded, so an additional shard-aggregation step needs to happen across 100s of shards• Solution• Use Async requests or independent ExecutorServices for syncrequests (one per each shard or vertical)
    24. 24. LinkedIn : Scaling Web Services@r39132 24Best Practices• Fault-tolerance Support3. Cancel Unnecessary Work• Issues• Work issued down the call-graphs is unnecessary if the clients at thetop of the call graph have already timed out• Imagine that as a call reaches half-way down your call-tree, thecaller at the root times out.• You will still issue work down the remaining half-depth of yourtree unless you cancel it!• Solution• A possible approach• Root of the call-tree adds (<tree-UUID>, inProgress status) toMemcached• All services pass the tree-UUID down the call-tree (e.g. as aHTTP custom request header)• Servlet filters at each hop check whether inProgress == false. Ifso, immediately respond with an empty response
    25. 25. LinkedIn : Scaling Web Services@r39132 25Best Practices• Fault-tolerance Support4. Avoid Sending Requests to Hosts that are GCing• Issues• If a client sends a web request to a host in Service C and if that hostis experiencing a GC pause, the client will wait 50-200ms, dependingon the read time out for the request• During that GC pause other requests will also be sent to that nodebefore they all eventually time out• Solution• Send a “GC scout” request before every “real” web request
    26. 26. LinkedIn : Scaling Web Services@r39132 26Why is this a good idea?• Scout requests are cheap and provide negligible overhead for requestsStep 1 :A Service B node sends a cheap1 msec TCP request to adedicated “scout” Netty portStep 2 :If the scout request comesback within 1 msec, send thereal request to the Tomcat orJetty portStep 3 :Else repeat with a differenthost in Service CB BNetty TomcatZKZKZKCB BNetty TomcatZKZKZKCB B ZKZKZKNetty TomcatCNetty TomcatC
    27. 27. LinkedIn : Scaling Web Services@r39132 27Best Practice• Fault-tolerance Support5. Services should protect themselves from traffic bursts• Issues• Service nodes should protect themselves from being over-whelmedby requests• This will also protect their downstream servers from beingoverwhelmed• Simply setting the tomcat or jetty thread pool size is not always anoption. Often times, these are not configurable per application.• Solution• Use a sliding window counter. If the counter exceeds a configuredthreshold, return immediately with a 503 („service unavailable‟)• Set threshold below Tomcat or Jetty thread pool size
    28. 28. Espresso : Scaling DatabasesLinkedIn : Databases28@r39132
    29. 29. Espresso : Overview@r39132 29Problem• What do we do when we run out of QPS capacity on an Oracle database server?• You can only buy yourself out of this problem so far (i.e. buy a bigger box)• Read-replicas and memcached will help scale reads, but not writes!Solution  EspressoYou need a horizontally-scalable database!Espresso is LinkedIn‟s newest NoSQL store. It offers the following features:• Horizontal Scalability• Works on commodity hardware• Document-centric• Avro documents supporting rich-nested data models• Schema-evolution is drama free• Extensions for Lucene indexing• Supports Transactions (within a partition, e.g. memberId)• Supports conditional reads & writes using standard HTTP headers (e.g. if-modified-since)
    30. 30. Espresso : Overview@r39132 30Why not use Open-source?• Change capture stream (e.g. Databus)• Backup-restore• Mature storage-engine (innodb)
    31. 31. 31• Components• Request Routing Tier• Consults Cluster Manager todiscover node to route to• Forwards request toappropriate storage node• Storage Tier• Data Store (MySQL)• Local Secondary Index(Lucene)• Cluster Manager• Responsible for data setpartitioning• Manages storage nodes• Relay Tier• Replicates data toconsumersEspresso: Architecture@r39132
    32. 32. Databus : Scaling DatabasesLinkedIn : Database Streams32@r39132
    33. 33. 33DataBus : OverviewProblemOur databases (Oracle & Espresso) are used for R/W web-site traffic.However, various services (Search, Graph DB, Standardization, etc…) need the abilityto• Read the data as it is changed in these OLTP stores• Occasionally, scan the contents in order rebuild their entire stateSolution  DatabusDatabus provides a consistent, in-time-order stream of database changes that• Scales horizontally• Protects the source database from high-read-load@r39132
    34. 34. Where Does LinkedIn useDataBus?34@r39132 34
    35. 35. 35DataBus : Usage @ LinkedInOracle orEspresso Data Change EventsSearchIndexGraphIndexReadReplicasUpdatesStandardizationA user updates the company, title, & school on his profile. He also accepts aconnection• The write is made to an Oracle or Espresso Master and DataBus replicates:• the profile change is applied to the Standardization service E.g. the many forms of IBM were canonicalized for search-friendliness andrecommendation-friendliness• the profile change is applied to the Search Index service Recruiters can find you immediately by new keywords• the connection change is applied to the Graph Index service The user can now start receiving feed updates from his new connections immediately@r39132
    36. 36. RelayEvent Win36DBBootstrapCaptureChangesOn-lineChangesDBDataBus consists of 2 services• Relay Service• Sharded• Maintain an in-memory buffer pershard• Each shard polls Oracle and thendeserializes transactions into Avro• Bootstrap Service• Picks up online changes as theyappear in the Relay• Supports 2 types of operationsfrom clients If a client falls behind andneeds records older than whatthe relay has, Bootstrap cansend consolidated deltas! If a new client comes on lineor if an existing client fell toofar behind, Bootstrap cansend a consistent snapshotDataBus : Architecture@r39132
    37. 37. RelayEvent Win37DBBootstrapCaptureChangesOn-lineChangesOn-lineChangesDBConsistentSnapshot at UConsumer 1Consumer nDatabusClientLibClientConsumer 1Consumer nDatabusClientLibClientGuarantees Transactions In-commit-order Delivery  commits are replicated in order Durability  you can replay the change stream at any time in the future Reliability  0% data loss Low latency  If your consumers can keep up with the relay  sub-secondresponse timeDataBus : Architecture@r39132
    38. 38. 38DataBus : ArchitectureCool Features Server-side (i.e. relay-side & bootstrap-side) filters Problem Say that your consuming service is sharded 100 ways e.g. Member Search Indexes sharded by member_id % 100 index_0, index_1, …, index_99 However, you have a single member Databus stream How do you avoid having every shard read data it is not interested in? Solution Easy, Databus already understands the notion of server-side filters It will only send updates to your consumer instance for the shard it isinterested in@r39132
    39. 39. Kafka: Scaling MessagingLinkedIn : Messaging39@r39132
    40. 40. 40Kafka : OverviewProblemWe have Databus to stream changes that were committed to a database. How do wecapture and stream high-volume data if we relax the requirement that the data needslong-term durability?• In other words, the data can have limited retentionChallenges• Needs to handle a large volume of events• Needs to be highly-available, scalable, and low-latency• Needs to provide limited durability guarantees (e.g. data retained for a week)Solution  KafkaKafka is a messaging system that supports topics. Consumers can subscribe to topicsand read all data within the retention window. Consumers are then notified of newmessages as they appear!@r39132
    41. 41. 41Kafka is used at LinkedIn for a variety of business-critical needs:Examples:• End-user Activity Tracking (a.k.a. Web Tracking)• Emails opened• Logins• Pages Seen• Executed Searches• Social Gestures : Likes, Sharing, Comments• Data Center Operational Metrics• Network & System metrics such as• TCP metrics (connection resets, message resends, etc…)• System metrics (iops, CPU, load average, etc…)Kafka : Usage @ LinkedIn@r39132
    42. 42. 42WebTierTopic 1Broker TierPushEventsTopic 2Topic NZookeeper Message IdManagementTopic, PartitionOwnershipSequential write sendfileKafkaClientLibConsumersPullEvents Iterator 1Iterator nTopic  Message Id100 MB/sec 200 MB/sec Pub/Sub Batch Send/Receive E2E Compression System DecouplingFeatures Guarantees At least once delivery Very high throughput Low latency (0.8) Durability (for a time period) Horizontally ScalableKafka : Architecture@r39132• Average Unique Message @Peak• writes/sec = 460k• reads/sec: 2.3m• # topics: 69328 billion unique messages written per dayScale at LinkedIn
    43. 43. 43Improvements in 0.8• Low Latency Features• Kafka has always been designed for high-throughput, but E2E latency couldhave been as high as 30 seconds• Feature 1 : Long-polling• For high throughput requests, a consumer‟s request for data will always befulfilled• For low throughput requests, a consumer‟s request will likely return 0 bytes,causing the consumer to back-off and wait. What happens if data arrives onthe broker in the meantime?• As of 0.8, a consumer can “park” a request on the broker for as muchas “m milliseconds have passed”• If data arrives during this period, it is instantly returned to theconsumerKafka : Overview@r39132
    44. 44. 44Improvements in 0.8• Low Latency Features• In the past, data was not visible to a consumer until it was flushed to disk on thebroker.• Feature 2 : New Commit Protocol• In 0.8, replicas and a new commit protocol has been introduced. As long asdata has been replicated to the memory of all replicas, even if it has notbeen flushed to disk on any one of them, it is considered “committed” andbecomes visible to consumersKafka : Overview@r39132
    45. 45.  Jay Kreps (Kafka) Neha Narkhede (Kafka) Kishore Gopalakrishna (Helix) Bob Shulman (Espresso) Cuong Tran (Perfomance & Scalability) Diego “Mono” Buthay (Search Infrastructure)45Acknowledgments@r39132
    46. 46. y Questions?46@r39132 46

    ×