Your SlideShare is downloading. ×
0
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

17,387

Published on

Hadoop Summit 2010 - application track …

Hadoop Summit 2010 - application track
Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn

Published in: Technology
3 Comments
91 Likes
Statistics
Notes
No Downloads
Views
Total Views
17,387
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
3
Likes
91
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • Why linkedin cares about derived data Why it is hard
  • Talk about what you can do
  • if you get bad results, I claim you are in an unsuccessful test! Still a small percentage of the quadrillion possible relationships (pairwise is hard)
  • What we learned
  • Azk is a workflow scheduler? What is workflow?
  • Samurai rule Logic is in jobs, not job descriptor Jobs are independent Work – viz, polish
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Data Applications and Infrastructure at LinkedIn <ul><li>Jay Kreps </li></ul>LinkedIn
    • 2. Plan <ul><ul><ul><li>`whoami` </li></ul></ul></ul><ul><ul><ul><li>Data products </li></ul></ul></ul><ul><ul><ul><li>Data infrastructure </li></ul></ul></ul>
    • 3. Data-centric engineering at LinkedIn <ul><ul><ul><li>LinkedIn’s Search Network &amp; Analytics team </li></ul></ul></ul><ul><ul><ul><li>Domain: Derived data </li></ul></ul></ul><ul><ul><ul><li>Products </li></ul></ul></ul><ul><ul><ul><ul><li>Search </li></ul></ul></ul></ul><ul><ul><ul><ul><li>People you may know </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Social graph services </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Job matching </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Collaborative filtering </li></ul></ul></ul></ul><ul><ul><ul><li>Infrastructure </li></ul></ul></ul>
    • 4. People You May Know
    • 5. Other products
    • 6. People You May Know <ul><li>120 billion relationships scored...every day </li></ul><ul><li>82 hadoop jobs (not counting ETL) </li></ul><ul><li>Around 16TB of intermediate data </li></ul><ul><li>Machine learning model to predict probability of connection </li></ul><ul><li>Bloom filter&apos;s for approximate filtering joins (10x perf improvement) </li></ul><ul><li>About ~5 test algorithms per week </li></ul><ul><li>2 engineers </li></ul>
    • 7. Relevance Products <ul><li>You must fly entirely by the instruments </li></ul><ul><li>Scale and relevance very closely linked </li></ul><ul><ul><li>More is often better </li></ul></ul><ul><ul><li>Iteration time is essential </li></ul></ul><ul><li>UI matters, really </li></ul><ul><li>We threw out custom non-hadoop code that was faster </li></ul><ul><li>Opportunity to work directly on the business </li></ul>
    • 8. Infrastructure as an Ecosystem <ul><li>Isolated infrastructure team is usually a bad solution </li></ul><ul><ul><li>Too isolated from the problems </li></ul></ul><ul><li>Data product team has crushing problems </li></ul><ul><ul><li>This area is extremely immature </li></ul></ul><ul><li>People should want to use it </li></ul><ul><li>Treat it like a product </li></ul><ul><ul><li>Either make money off it or give it away </li></ul></ul><ul><ul><li>Open source is a great solution </li></ul></ul><ul><ul><li>Custom software should be the best </li></ul></ul>
    • 9. Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (now in Mahout) Norbert – Partition aware cluster management &amp; RPC Voldemort – Key/Value storage Kamikaze – Compression package Sensei – Distributed search Azkaban – Hadoop workflow
    • 10. Azkaban workflow = cron + make
    • 11. Azkaban workflow:hadoop :: web framework:webapp
    • 12. Azkaban
    • 13. Azkaban Examples <ul><li>Example job source: </li></ul>Example workflow UI
    • 14. Workflow
    • 15. Azkaban <ul><li>82 jobs running every day just for PYMK </li></ul><ul><ul><li>...need to run in the right order </li></ul></ul><ul><ul><li>… need to restart from failure </li></ul></ul><ul><ul><li>… need to enforce dependencies </li></ul></ul><ul><li>GUI is important for operations </li></ul><ul><li>alerting, resource locking, config management, etc </li></ul><ul><li>deployable zip files of code represent a job flow </li></ul><ul><li>everyone works independently, releases/deploys independently </li></ul><ul><li>simple text files for config (but can use GUI in a pinch) </li></ul><ul><li>aggregate logs, run times </li></ul><ul><li>restart from point of failure </li></ul>
    • 16. Data Deployment How do you get your multi-billion edge probabilistic relationship graph to the live website to serve queries?
    • 17. Voldemort <ul><li>LinkedIn had many prior passes at this problem, all bad </li></ul><ul><ul><li>MySQL </li></ul></ul><ul><ul><li>Oracle </li></ul></ul><ul><ul><li>Etc. </li></ul></ul><ul><li>Fully distributed, partitioned, decentralized key-value storage </li></ul><ul><li>Supports pluggable storage engines </li></ul><ul><li>Online/offline cycle </li></ul><ul><li>Is this a good fit? </li></ul>
    • 18. Voldemort Data Deployment
    • 19. Voldemort Data Deployment <ul><li>Building a multi TB lookup structure is really, really hard work...it is a batch operation </li></ul><ul><li>Solution: build this structure in hadoop </li></ul><ul><li>Tradeoff: build time vs lookup time </li></ul><ul><ul><li>Minimal perfect hashing requires only 2.5 bits per key, but is slow to build </li></ul></ul><ul><ul><li>Sorted indexes are a fast, simple alternative </li></ul></ul><ul><li>Build is a no-op map/reduce (just sorting) </li></ul><ul><li>Data load will saturate the network even for small cluster </li></ul><ul><li>Voldemort gives </li></ul><ul><ul><li>failover </li></ul></ul><ul><ul><li>load balancing </li></ul></ul><ul><ul><li>monitoring </li></ul></ul><ul><ul><li>remote access </li></ul></ul><ul><ul><li>partitioning </li></ul></ul>
    • 20. Voldemort Data Deployment <ul><li>If data takes 24 hours to generate, it may take 24 hours to fix </li></ul><ul><ul><li>Need a faster rollback strategy </li></ul></ul><ul><li>Cold disk space is cheap </li></ul><ul><ul><li>Store the live copy </li></ul></ul><ul><ul><li>Store the copy currently being updated </li></ul></ul><ul><ul><li>Store N backup copies </li></ul></ul><ul><ul><li>“ Atomic” swap </li></ul></ul><ul><li>Cache needs to start warm </li></ul><ul><li>I/O network throttling to limit impact of deployment </li></ul><ul><li>Our prod latency is &lt; 3 ms from the client side </li></ul><ul><li>900GB store takes ~1:30 to build on 45 node dev cluster </li></ul>
    • 21. Questions?

    ×