• Save
Online Content Optimization with Hadoop__HadoopSummit2010
Upcoming SlideShare
Loading in...5

Online Content Optimization with Hadoop__HadoopSummit2010



Hadoop Summit 2010 - Application Track

Hadoop Summit 2010 - Application Track
Online Content Optimization with Hadoop
Amit Phadke, Yahoo!



Total Views
Views on SlideShare
Embed Views



1 Embed 5 5



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Online Content Optimization with Hadoop__HadoopSummit2010 Presentation Transcript

  • 1. Online Content Optimization using Hadoop
    • Doug Campbell
    • Amit Phadke
    • Albert Shau
    • Nitin Motgi
  • 2.
    • General Problem
      • Across several different display contexts
      • Across several content pools
      • Rank content for each item and potentially for each user.
    • From a pool of stories, pick the “top”
      • Many different notions of “top”
        • Relevance, popularity, demographic popularity, affinity, per user
        • Each its own experiment in need of tracking
      • Pools of items in the tens, hundreds, or millions
    Content Optimization
  • 3. Data Flow
    • Processes hundreds of GBs of data every run
    • Generate model artifacts from every few mins to few hours
    • Builds models for millions of items, and users
    • Grids in use of varying sizes
  • 4. Content Meta Data Source
    • Meta data used for extracting item features
    • Stored in Hbase
  • 5. Event Data Source
    • Processed events from user activity on the site
    • Normalized and Loaded on to HDFS
  • 6. HBase
    • Maintain models for millions of items, and users
    • Efficient joins of small set of users with events to large set of models
    • Random lookups/writes to update models independently
  • 7. HBase
    • Fast scans allow for global operations
    • Columns families support ease of experimentation
    • State@TimeX easy for any record
    • Out of order event processing
  • 8. MapReduce
    • Uses Pig to abstract out MR details
    • Parallelism over HBase (LOAD, STORE)
  • 9. Job co-ordination
    • Orchestrates a workflow by stringing together multiple processing units
    • Schedules and launches jobs at desired intervals
  • 10. Leader Election (Zookeeper)
    • Leader Election for Pig/MR Jobs
    • Spreads load across multiple Workflow Machines
    • Provides Redundancy and node failure tolerance
  • 11. Edge Services
    • Things that don’t need to run on Hadoop
    • IP2GEO
    • Item Enrichment
  • 12. Serving Store
    • Low-latency access to key-valued data
    • Replicated to multiple colos
  • 13. Questions?
    • {dcampbel, aphadke,ashau}@yahoo-inc.com