Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Data Infrastructure at LinkedIn



Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Hadoop Summit 2010 - application track
Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

  1. 1. Data Applications and Infrastructure at LinkedIn <ul><li>Jay Kreps </li></ul>LinkedIn
  2. 2. Plan <ul><ul><ul><li>`whoami` </li></ul></ul></ul><ul><ul><ul><li>Data products </li></ul></ul></ul><ul><ul><ul><li>Data infrastructure </li></ul></ul></ul>
  3. 3. Data-centric engineering at LinkedIn <ul><ul><ul><li>LinkedIn’s Search Network & Analytics team </li></ul></ul></ul><ul><ul><ul><li>Domain: Derived data </li></ul></ul></ul><ul><ul><ul><li>Products </li></ul></ul></ul><ul><ul><ul><ul><li>Search </li></ul></ul></ul></ul><ul><ul><ul><ul><li>People you may know </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Social graph services </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Job matching </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Collaborative filtering </li></ul></ul></ul></ul><ul><ul><ul><li>Infrastructure </li></ul></ul></ul>
  4. 4. People You May Know
  5. 5. Other products
  6. 6. People You May Know <ul><li>120 billion relationships scored...every day </li></ul><ul><li>82 hadoop jobs (not counting ETL) </li></ul><ul><li>Around 16TB of intermediate data </li></ul><ul><li>Machine learning model to predict probability of connection </li></ul><ul><li>Bloom filter's for approximate filtering joins (10x perf improvement) </li></ul><ul><li>About ~5 test algorithms per week </li></ul><ul><li>2 engineers </li></ul>
  7. 7. Relevance Products <ul><li>You must fly entirely by the instruments </li></ul><ul><li>Scale and relevance very closely linked </li></ul><ul><ul><li>More is often better </li></ul></ul><ul><ul><li>Iteration time is essential </li></ul></ul><ul><li>UI matters, really </li></ul><ul><li>We threw out custom non-hadoop code that was faster </li></ul><ul><li>Opportunity to work directly on the business </li></ul>
  8. 8. Infrastructure as an Ecosystem <ul><li>Isolated infrastructure team is usually a bad solution </li></ul><ul><ul><li>Too isolated from the problems </li></ul></ul><ul><li>Data product team has crushing problems </li></ul><ul><ul><li>This area is extremely immature </li></ul></ul><ul><li>People should want to use it </li></ul><ul><li>Treat it like a product </li></ul><ul><ul><li>Either make money off it or give it away </li></ul></ul><ul><ul><li>Open source is a great solution </li></ul></ul><ul><ul><li>Custom software should be the best </li></ul></ul>
  9. 9. Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (now in Mahout) Norbert – Partition aware cluster management & RPC Voldemort – Key/Value storage Kamikaze – Compression package Sensei – Distributed search Azkaban – Hadoop workflow
  10. 10. Azkaban workflow = cron + make
  11. 11. Azkaban workflow:hadoop :: web framework:webapp
  12. 12. Azkaban
  13. 13. Azkaban Examples <ul><li>Example job source: </li></ul>Example workflow UI
  14. 14. Workflow
  15. 15. Azkaban <ul><li>82 jobs running every day just for PYMK </li></ul><ul><ul><li>...need to run in the right order </li></ul></ul><ul><ul><li>… need to restart from failure </li></ul></ul><ul><ul><li>… need to enforce dependencies </li></ul></ul><ul><li>GUI is important for operations </li></ul><ul><li>alerting, resource locking, config management, etc </li></ul><ul><li>deployable zip files of code represent a job flow </li></ul><ul><li>everyone works independently, releases/deploys independently </li></ul><ul><li>simple text files for config (but can use GUI in a pinch) </li></ul><ul><li>aggregate logs, run times </li></ul><ul><li>restart from point of failure </li></ul>
  16. 16. Data Deployment How do you get your multi-billion edge probabilistic relationship graph to the live website to serve queries?
  17. 17. Voldemort <ul><li>LinkedIn had many prior passes at this problem, all bad </li></ul><ul><ul><li>MySQL </li></ul></ul><ul><ul><li>Oracle </li></ul></ul><ul><ul><li>Etc. </li></ul></ul><ul><li>Fully distributed, partitioned, decentralized key-value storage </li></ul><ul><li>Supports pluggable storage engines </li></ul><ul><li>Online/offline cycle </li></ul><ul><li>Is this a good fit? </li></ul>
  18. 18. Voldemort Data Deployment
  19. 19. Voldemort Data Deployment <ul><li>Building a multi TB lookup structure is really, really hard is a batch operation </li></ul><ul><li>Solution: build this structure in hadoop </li></ul><ul><li>Tradeoff: build time vs lookup time </li></ul><ul><ul><li>Minimal perfect hashing requires only 2.5 bits per key, but is slow to build </li></ul></ul><ul><ul><li>Sorted indexes are a fast, simple alternative </li></ul></ul><ul><li>Build is a no-op map/reduce (just sorting) </li></ul><ul><li>Data load will saturate the network even for small cluster </li></ul><ul><li>Voldemort gives </li></ul><ul><ul><li>failover </li></ul></ul><ul><ul><li>load balancing </li></ul></ul><ul><ul><li>monitoring </li></ul></ul><ul><ul><li>remote access </li></ul></ul><ul><ul><li>partitioning </li></ul></ul>
  20. 20. Voldemort Data Deployment <ul><li>If data takes 24 hours to generate, it may take 24 hours to fix </li></ul><ul><ul><li>Need a faster rollback strategy </li></ul></ul><ul><li>Cold disk space is cheap </li></ul><ul><ul><li>Store the live copy </li></ul></ul><ul><ul><li>Store the copy currently being updated </li></ul></ul><ul><ul><li>Store N backup copies </li></ul></ul><ul><ul><li>“ Atomic” swap </li></ul></ul><ul><li>Cache needs to start warm </li></ul><ul><li>I/O network throttling to limit impact of deployment </li></ul><ul><li>Our prod latency is < 3 ms from the client side </li></ul><ul><li>900GB store takes ~1:30 to build on 45 node dev cluster </li></ul>
  21. 21. Questions?
  • NattapongPattanapon

    Feb. 2, 2020
  • LeonMcNamara

    Nov. 23, 2019
  • pastas9

    Nov. 8, 2016
  • KeitaBroadwaterMBAPh

    Oct. 22, 2016
  • MohammadAbdalla4

    Oct. 21, 2016
  • panesofglass

    Jan. 7, 2016
  • gaoyingju

    Oct. 22, 2015
  • rakibquandl

    Aug. 3, 2015
  • SairamHazare

    Jul. 24, 2015
  • rafal72

    May. 25, 2015
  • hanhvd

    Mar. 20, 2015
  • lawliet1992

    Mar. 18, 2015
  • kobe2014

    Jan. 23, 2015
  • hakminkim376

    Aug. 25, 2014
  • ahmadaassaf

    Mar. 5, 2014
  • leepro1

    Aug. 8, 2013
  • YihYoonLee

    Feb. 4, 2013
  • asingla

    Aug. 24, 2012
  • rlchandani

    Jun. 24, 2012
  • fleeting001

    Jun. 7, 2012

Hadoop Summit 2010 - application track Data Applications and Infrastructure at LinkedIn Jay Kreps, LinkedIn


Total views


On Slideshare


From embeds


Number of embeds