SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
0M 50M 300M 250M 200M
150M 100M 400M 32M 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 5 400M 350M Fast forward to today...
LinkedIn is a global site
with over 400 million members Web pages and mobile traffic are served at tens of thousands of queries per second Backend systems serve millions of queries per second LINKEDIN SCALE TODAY
DB LEO ● Huge monolithic
app called Leo ● Java, JSP, Servlets, JDBC ● Served every page, same SQL database LEO Circa 2003 LINKEDIN’S ORIGINAL ARCHITECTURE
So far so good, but
two areas to improve: 1. The growing member to member connection graph 2. The ability to search those members
● Needed to live in-memory
for top performance ● Used graph traversal queries not suitable for the shared SQL database. ● Different usage profile than other parts of site MEMBER CONNECTION GRAPH
MEMBER CONNECTION GRAPH So, a
dedicated service was created. LinkedIn’s first service. ● Needed to live in-memory for top performance ● Used graph traversal queries not suitable for the shared SQL database. ● Different usage profile than other parts of site
Getting better, but the single
database was under heavy load. Vertically scaling helped, but we needed to offload the read traffic...
● Master/slave concept ● Read-only
traffic from replica ● Writes go to main DB ● Early version of Databus kept DBs in sync REPLICA DBs Main DB Replica ReplicaDatabus relay Replica DB
● Good medium term solution
● We could vertically scale servers for a while ● Master DBs have finite scaling limits ● These days, LinkedIn DBs use partitioning REPLICA DBs TAKEAWAYS Main DB Replica ReplicaDatabus relay Replica DB
Member GraphLEO RPC Main DB
ReplicaReplicaDatabus relay Replica DB Connection Updates R/WR/O Circa 2006 LINKEDIN WITH REPLICA DBs Search Profile Updates
As LinkedIn continued to grow,
the monolithic application Leo was becoming problematic. Leo was difficult to release, debug, and the site kept going down...
Public Profile Web App Profile
Service LEO Recruiter Web App Yet another Service Extracting services (Java Spring MVC) from legacy Leo monolithic application Circa 2008 on SERVICE ORIENTED ARCHITECTURE
● Goal - create vertical
stack of stateless services ● Frontend servers fetch data from many domains, build HTML or JSON response ● Mid-tier services host APIs, business logic ● Data-tier or back-tier services encapsulate data domains Profile Web App Profile Service Profile DB SERVICE ORIENTED ARCHITECTURE
Groups Content Service Connections Content
Service Profile Content Service Browser / App Frontend Web App Mid-tier Service Mid-tier Service Mid-tier Service Edu Data Service Data Service Hadoop DB Voldemort EXAMPLE MULTI-TIER ARCHITECTURE AT LINKEDIN Kafka
PROS ● Stateless services easily
scale ● Decoupled domains ● Build and deploy independently CONS ● Ops overhead ● Introduces backwards compatibility issues ● Leads to complex call graphs and fanout SERVICE ORIENTED ARCHITECTURE COMPARISON
bash$ eh -e %%prod |
awk -F. '{ print $2 }' | sort | uniq | wc -l 756 ● In 2003, LinkedIn had one service (Leo) ● By 2010, LinkedIn had over 150 services ● Today in 2015, LinkedIn has over 750 services SERVICES AT LINKEDIN
● Simple way to reduce
load on servers and speed up responses ● Mid-tier caches store derived objects from different domains, reduce fanout ● Caches in the data layer ● We use memcache, couchbase, even Voldemort Frontend Web App Mid-tier Service Cache DB Cache CACHING
There are only two hard
problems in Computer Science: Cache invalidation, naming things, and off-by-one errors. “ Via Twitter by Kellan Elliott-McCrea and later Jonathan Feinberg
CACHING TAKEAWAYS ● Caches are
easy to add in the beginning, but complexity adds up over time. ● Over time LinkedIn removed many mid-tier caches because of the complexity around invalidation ● We kept caches closer to data layer
CACHING TAKEAWAYS (cont.) ● Services
must handle full load - caches improve speed, not permanent load bearing solutions ● We’ll use a low latency solution like Voldemort when appropriate and precompute results
LinkedIn’s hypergrowth was extending to
the vast amounts of data it collected. Individual pipelines to route that data weren’t scaling. A better solution was needed...
KAFKA MOTIVATIONS ● LinkedIn generates
a ton of data ○ Pageviews ○ Edits on profile, companies, schools ○ Logging, timing ○ Invites, messaging ○ Tracking ● Billions of events everyday ● Separate and independently created pipelines routed this data
A WHOLE LOT OF CUSTOM
PIPELINES... As LinkedIn needed to scale, each pipeline needed to scale.
Distributed pub-sub messaging platform as
LinkedIn’s universal data pipeline KAFKA Kafka Frontend service Frontend service Backend Service DWH Monitoring Analytics HadoopOracle
BENEFITS ● Enabled near realtime
access to any data source ● Empowered Hadoop jobs ● Allowed LinkedIn to build realtime analytics ● Vastly improved site monitoring capability ● Enabled devs to visualize and track call graphs ● Over 1 trillion messages published per day, 10 million messages per second KAFKA AT LINKEDIN
● Services extracted from Leo
or created new were inconsistent and often tightly coupled ● Rest.li was our move to a data model centric architecture ● It ensured a consistent stateless Restful API model across the company. REST.LI
● By using JSON over
HTTP, our new APIs supported non-Java-based clients. ● By using Dynamic Discovery (D2), we got load balancing, discovery, and scalability of each service API. ● Today, LinkedIn has 1130+ Rest.li resources and over 100 billion Rest.li calls per day REST.LI (cont.)
● It was clear LinkedIn
could build data infrastructure that enables long term growth ● LinkedIn doubled down on infra solutions like: ○ Storage solutions ■ Espresso, Voldemort, Ambry (media) ○ Analytics solutions like Pinot ○ Streaming solutions ■ Kafka, Databus, and Samza ○ Cloud solutions like Helix and Nuage DATA INFRASTRUCTURE
● Natural progression of horizontally
scaling ● Replicate data across many data centers using storage technology like Espresso ● Pin users to geographically close data center ● Difficult but necessary MULTIPLE DATA CENTERS
● Multiple data centers are
imperative to maintain high availability. ● You need to avoid any single point of failure not just for each service, but the entire site. ● LinkedIn runs out of three main data centers, additional PoPs around the globe, and more coming online every day... MULTIPLE DATA CENTERS
● Each of LinkedIn’s critical
systems have undergone their own rich history of scale (graph, search, analytics, profile backend, comms, feed) ● LinkedIn uses Hadoop / Voldemort for insights like People You May Know, Similar profiles, Notable Alumni, and profile browse maps. WHAT ELSE HAVE WE DONE?
● Re-architected frontend approach using
○ Client templates ○ BigPipe ○ Play Framework ● LinkedIn added multiple tiers of proxies using Apache Traffic Server and HAProxy ● We improved the performance of servers with new hardware, advanced system tuning, and newer Java runtimes. WHAT ELSE HAVE WE DONE? (cont.)
Hofstadter's Law: It always takes
longer than you expect, even when you take into account Hofstadter's Law. “ Via Douglas Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid
● Blog version of this
slide deck https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin ● Visual story of LinkedIn’s history https://ourstory.linkedin.com/ ● LinkedIn Engineering blog https://engineering.linkedin.com ● LinkedIn Open-Source https://engineering.linkedin.com/open-source ● LinkedIn’s communication system slides which include earliest LinkedIn architecture http://www.slideshare. net/linkedin/linkedins-communication-architecture ● Slides which include earliest LinkedIn data infra work http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012 LEARN MORE
● Project Inversion - internal
project to enable developer productivity (trunk based model), faster deploys, unified services http://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code- freeze-that-saved-linkedin ● LinkedIn’s use of Apache Traffic server http://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic- server ● Multi Data Center - testing fail overs https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted- angel-au-yeung LEARN MORE (cont.)
● History and motivation around
Kafka http://www.confluent.io/blog/stream-data-platform-1/ ● Thinking about streaming solutions as a commit log https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer- should-know-about-real-time-datas-unifying ● Kafka enabling monitoring and alerting http://engineering.linkedin.com/52/autometrics-self-service-metrics-collection ● Kafka enabling real-time analytics (Pinot) http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot ● Kafka’s current use and future at LinkedIn http://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future ● Kafka processing 1 trillion events per day https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing- kafka-linkedin LEARN MORE - KAFKA
● Open sourcing Databus https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-
latency-change-data-capture-system ● Samza streams to help LinkedIn view call graphs https://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using- apache-samza ● Real-time analytics (Pinot) http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot ● Introducing Espresso data store http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new- distributed-document-store LEARN MORE - DATA INFRASTRUCTURE
● LinkedIn’s use of client
templates ○ Dust.js http://www.slideshare.net/brikis98/dustjs ○ Profile http://engineering.linkedin.com/profile/engineering-new-linkedin-profile ● Big Pipe on LinkedIn’s homepage http://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-page ● Play Framework ○ Introduction at LinkedIn https://engineering.linkedin. com/play/composable-and-streamable-play-apps ○ Switching to non-block asynchronous model https://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool- and-callback-hell LEARN MORE - FRONTEND TECH
● Introduction to Rest.li and
how it helps LinkedIn scale http://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale ● How Rest.li expanded across the company http://engineering.linkedin.com/restli/linkedins-restli-moment LEARN MORE - REST.LI
● JVM memory tuning http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-
throughput-and-low-latency-java-applications ● System tuning http://engineering.linkedin.com/performance/optimizing-linux-memory-management- low-latency-high-throughput-databases ● Optimizing JVM tuning automatically https://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its- difficulties-and-using-jtune-solution LEARN MORE - SYSTEM TUNING
LinkedIn continues to grow quickly
and there’s still a ton of work we can do to improve. We’re working on problems that very few ever get to solve - come join us! WE’RE HIRING