• Save
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×
 

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

on

  • 17,942 views

Hadoop Summit 2010 - application track

Hadoop Summit 2010 - application track
Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn

Statistics

Views

Total Views
17,942
Views on SlideShare
17,708
Embed Views
234

Actions

Likes
87
Downloads
0
Comments
2

10 Embeds 234

http://dorai.me 68
http://dorai.wordpress.com 58
http://www.recruitmentdirectory.com.au 54
http://perf-tuning.blogspot.ru 34
http://www.slideshare.net 8
http://misc.apistilli.com 7
http://a0.twimg.com 2
http://perf-tuning.blogspot.com 1
http://annamalai-subbu.blogspot.in 1
http://annamalai-subbu.blogspot.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • Why linkedin cares about derived data Why it is hard
  • Talk about what you can do
  • if you get bad results, I claim you are in an unsuccessful test! Still a small percentage of the quadrillion possible relationships (pairwise is hard)
  • What we learned
  • Azk is a workflow scheduler? What is workflow?
  • Samurai rule Logic is in jobs, not job descriptor Jobs are independent Work – viz, polish
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010 Data Applications and Infrastructure at LinkedIn__HadoopSummit2010 Presentation Transcript

  • Data Applications and Infrastructure at LinkedIn
    • Jay Kreps
    LinkedIn
  • Plan
        • `whoami`
        • Data products
        • Data infrastructure
  • Data-centric engineering at LinkedIn
        • LinkedIn’s Search Network & Analytics team
        • Domain: Derived data
        • Products
          • Search
          • People you may know
          • Social graph services
          • Job matching
          • Collaborative filtering
        • Infrastructure
  • People You May Know
  • Other products
  • People You May Know
    • 120 billion relationships scored...every day
    • 82 hadoop jobs (not counting ETL)
    • Around 16TB of intermediate data
    • Machine learning model to predict probability of connection
    • Bloom filter's for approximate filtering joins (10x perf improvement)
    • About ~5 test algorithms per week
    • 2 engineers
  • Relevance Products
    • You must fly entirely by the instruments
    • Scale and relevance very closely linked
      • More is often better
      • Iteration time is essential
    • UI matters, really
    • We threw out custom non-hadoop code that was faster
    • Opportunity to work directly on the business
  • Infrastructure as an Ecosystem
    • Isolated infrastructure team is usually a bad solution
      • Too isolated from the problems
    • Data product team has crushing problems
      • This area is extremely immature
    • People should want to use it
    • Treat it like a product
      • Either make money off it or give it away
      • Open source is a great solution
      • Custom software should be the best
  • Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (now in Mahout) Norbert – Partition aware cluster management & RPC Voldemort – Key/Value storage Kamikaze – Compression package Sensei – Distributed search Azkaban – Hadoop workflow
  • Azkaban workflow = cron + make
  • Azkaban workflow:hadoop :: web framework:webapp
  • Azkaban
  • Azkaban Examples
    • Example job source:
    Example workflow UI
  • Workflow
  • Azkaban
    • 82 jobs running every day just for PYMK
      • ...need to run in the right order
      • … need to restart from failure
      • … need to enforce dependencies
    • GUI is important for operations
    • alerting, resource locking, config management, etc
    • deployable zip files of code represent a job flow
    • everyone works independently, releases/deploys independently
    • simple text files for config (but can use GUI in a pinch)
    • aggregate logs, run times
    • restart from point of failure
  • Data Deployment How do you get your multi-billion edge probabilistic relationship graph to the live website to serve queries?
  • Voldemort
    • LinkedIn had many prior passes at this problem, all bad
      • MySQL
      • Oracle
      • Etc.
    • Fully distributed, partitioned, decentralized key-value storage
    • Supports pluggable storage engines
    • Online/offline cycle
    • Is this a good fit?
  • Voldemort Data Deployment
  • Voldemort Data Deployment
    • Building a multi TB lookup structure is really, really hard work...it is a batch operation
    • Solution: build this structure in hadoop
    • Tradeoff: build time vs lookup time
      • Minimal perfect hashing requires only 2.5 bits per key, but is slow to build
      • Sorted indexes are a fast, simple alternative
    • Build is a no-op map/reduce (just sorting)
    • Data load will saturate the network even for small cluster
    • Voldemort gives
      • failover
      • load balancing
      • monitoring
      • remote access
      • partitioning
  • Voldemort Data Deployment
    • If data takes 24 hours to generate, it may take 24 hours to fix
      • Need a faster rollback strategy
    • Cold disk space is cheap
      • Store the live copy
      • Store the copy currently being updated
      • Store N backup copies
      • “ Atomic” swap
    • Cache needs to start warm
    • I/O network throttling to limit impact of deployment
    • Our prod latency is < 3 ms from the client side
    • 900GB store takes ~1:30 to build on 45 node dev cluster
  • Questions?