Data Applications and Infrastructure at LinkedIn__HadoopSummit2010


Published on

Hadoop Summit 2010 - application track
Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • Why linkedin cares about derived data Why it is hard
  • Talk about what you can do
  • if you get bad results, I claim you are in an unsuccessful test! Still a small percentage of the quadrillion possible relationships (pairwise is hard)
  • What we learned
  • Azk is a workflow scheduler? What is workflow?
  • Samurai rule Logic is in jobs, not job descriptor Jobs are independent Work – viz, polish
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

    1. 1. Data Applications and Infrastructure at LinkedIn <ul><li>Jay Kreps </li></ul>LinkedIn
    2. 2. Plan <ul><ul><ul><li>`whoami` </li></ul></ul></ul><ul><ul><ul><li>Data products </li></ul></ul></ul><ul><ul><ul><li>Data infrastructure </li></ul></ul></ul>
    3. 3. Data-centric engineering at LinkedIn <ul><ul><ul><li>LinkedIn’s Search Network & Analytics team </li></ul></ul></ul><ul><ul><ul><li>Domain: Derived data </li></ul></ul></ul><ul><ul><ul><li>Products </li></ul></ul></ul><ul><ul><ul><ul><li>Search </li></ul></ul></ul></ul><ul><ul><ul><ul><li>People you may know </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Social graph services </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Job matching </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Collaborative filtering </li></ul></ul></ul></ul><ul><ul><ul><li>Infrastructure </li></ul></ul></ul>
    4. 4. People You May Know
    5. 5. Other products
    6. 6. People You May Know <ul><li>120 billion relationships scored...every day </li></ul><ul><li>82 hadoop jobs (not counting ETL) </li></ul><ul><li>Around 16TB of intermediate data </li></ul><ul><li>Machine learning model to predict probability of connection </li></ul><ul><li>Bloom filter's for approximate filtering joins (10x perf improvement) </li></ul><ul><li>About ~5 test algorithms per week </li></ul><ul><li>2 engineers </li></ul>
    7. 7. Relevance Products <ul><li>You must fly entirely by the instruments </li></ul><ul><li>Scale and relevance very closely linked </li></ul><ul><ul><li>More is often better </li></ul></ul><ul><ul><li>Iteration time is essential </li></ul></ul><ul><li>UI matters, really </li></ul><ul><li>We threw out custom non-hadoop code that was faster </li></ul><ul><li>Opportunity to work directly on the business </li></ul>
    8. 8. Infrastructure as an Ecosystem <ul><li>Isolated infrastructure team is usually a bad solution </li></ul><ul><ul><li>Too isolated from the problems </li></ul></ul><ul><li>Data product team has crushing problems </li></ul><ul><ul><li>This area is extremely immature </li></ul></ul><ul><li>People should want to use it </li></ul><ul><li>Treat it like a product </li></ul><ul><ul><li>Either make money off it or give it away </li></ul></ul><ul><ul><li>Open source is a great solution </li></ul></ul><ul><ul><li>Custom software should be the best </li></ul></ul>
    9. 9. Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (now in Mahout) Norbert – Partition aware cluster management & RPC Voldemort – Key/Value storage Kamikaze – Compression package Sensei – Distributed search Azkaban – Hadoop workflow
    10. 10. Azkaban workflow = cron + make
    11. 11. Azkaban workflow:hadoop :: web framework:webapp
    12. 12. Azkaban
    13. 13. Azkaban Examples <ul><li>Example job source: </li></ul>Example workflow UI
    14. 14. Workflow
    15. 15. Azkaban <ul><li>82 jobs running every day just for PYMK </li></ul><ul><ul><li>...need to run in the right order </li></ul></ul><ul><ul><li>… need to restart from failure </li></ul></ul><ul><ul><li>… need to enforce dependencies </li></ul></ul><ul><li>GUI is important for operations </li></ul><ul><li>alerting, resource locking, config management, etc </li></ul><ul><li>deployable zip files of code represent a job flow </li></ul><ul><li>everyone works independently, releases/deploys independently </li></ul><ul><li>simple text files for config (but can use GUI in a pinch) </li></ul><ul><li>aggregate logs, run times </li></ul><ul><li>restart from point of failure </li></ul>
    16. 16. Data Deployment How do you get your multi-billion edge probabilistic relationship graph to the live website to serve queries?
    17. 17. Voldemort <ul><li>LinkedIn had many prior passes at this problem, all bad </li></ul><ul><ul><li>MySQL </li></ul></ul><ul><ul><li>Oracle </li></ul></ul><ul><ul><li>Etc. </li></ul></ul><ul><li>Fully distributed, partitioned, decentralized key-value storage </li></ul><ul><li>Supports pluggable storage engines </li></ul><ul><li>Online/offline cycle </li></ul><ul><li>Is this a good fit? </li></ul>
    18. 18. Voldemort Data Deployment
    19. 19. Voldemort Data Deployment <ul><li>Building a multi TB lookup structure is really, really hard is a batch operation </li></ul><ul><li>Solution: build this structure in hadoop </li></ul><ul><li>Tradeoff: build time vs lookup time </li></ul><ul><ul><li>Minimal perfect hashing requires only 2.5 bits per key, but is slow to build </li></ul></ul><ul><ul><li>Sorted indexes are a fast, simple alternative </li></ul></ul><ul><li>Build is a no-op map/reduce (just sorting) </li></ul><ul><li>Data load will saturate the network even for small cluster </li></ul><ul><li>Voldemort gives </li></ul><ul><ul><li>failover </li></ul></ul><ul><ul><li>load balancing </li></ul></ul><ul><ul><li>monitoring </li></ul></ul><ul><ul><li>remote access </li></ul></ul><ul><ul><li>partitioning </li></ul></ul>
    20. 20. Voldemort Data Deployment <ul><li>If data takes 24 hours to generate, it may take 24 hours to fix </li></ul><ul><ul><li>Need a faster rollback strategy </li></ul></ul><ul><li>Cold disk space is cheap </li></ul><ul><ul><li>Store the live copy </li></ul></ul><ul><ul><li>Store the copy currently being updated </li></ul></ul><ul><ul><li>Store N backup copies </li></ul></ul><ul><ul><li>“ Atomic” swap </li></ul></ul><ul><li>Cache needs to start warm </li></ul><ul><li>I/O network throttling to limit impact of deployment </li></ul><ul><li>Our prod latency is < 3 ms from the client side </li></ul><ul><li>900GB store takes ~1:30 to build on 45 node dev cluster </li></ul>
    21. 21. Questions?