Online Content Optimization using Hadoop <ul><li>Doug Campbell </li></ul><ul><li>Amit Phadke  </li></ul><ul><li>Albert Sha...
<ul><li>General Problem </li></ul><ul><ul><li>Across several different display contexts </li></ul></ul><ul><ul><li>Across ...
Data Flow <ul><li>Processes hundreds of GBs of data every run </li></ul><ul><li>Generate model artifacts from every few mi...
Content Meta Data Source <ul><li>Meta data used for extracting item features </li></ul><ul><li>Stored in Hbase </li></ul>
Event Data Source <ul><li>Processed events from user activity on the site </li></ul><ul><li>Normalized and Loaded on to HD...
HBase <ul><li>Maintain models for millions of items, and users </li></ul><ul><li>Efficient joins of small set of users wit...
HBase <ul><li>Fast scans allow for global operations </li></ul><ul><li>Columns families support ease of experimentation </...
MapReduce <ul><li>Uses Pig to abstract out MR details </li></ul><ul><li>Parallelism over HBase (LOAD, STORE) </li></ul>
Job co-ordination <ul><li>Orchestrates a workflow by stringing together multiple processing units </li></ul><ul><li>Schedu...
Leader Election (Zookeeper) <ul><li>Leader Election for Pig/MR Jobs </li></ul><ul><li>Spreads load across multiple Workflo...
Edge Services <ul><li>Things that don’t need to run on Hadoop </li></ul><ul><li>IP2GEO </li></ul><ul><li>Item Enrichment  ...
Serving Store <ul><li>Low-latency access to key-valued data </li></ul><ul><li>Replicated to multiple colos </li></ul>
Questions? <ul><li>{dcampbel, aphadke,ashau}@yahoo-inc.com </li></ul>
Upcoming SlideShare
Loading in...5
×

Online Content Optimization with Hadoop__HadoopSummit2010

1,260

Published on

Hadoop Summit 2010 - Application Track
Online Content Optimization with Hadoop
Amit Phadke, Yahoo!

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,260
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Online Content Optimization with Hadoop__HadoopSummit2010

    1. 1. Online Content Optimization using Hadoop <ul><li>Doug Campbell </li></ul><ul><li>Amit Phadke </li></ul><ul><li>Albert Shau </li></ul><ul><li>Nitin Motgi </li></ul>Yahoo!
    2. 2. <ul><li>General Problem </li></ul><ul><ul><li>Across several different display contexts </li></ul></ul><ul><ul><li>Across several content pools </li></ul></ul><ul><ul><li>Rank content for each item and potentially for each user. </li></ul></ul><ul><li>From a pool of stories, pick the “top” </li></ul><ul><ul><li>Many different notions of “top” </li></ul></ul><ul><ul><ul><li>Relevance, popularity, demographic popularity, affinity, per user </li></ul></ul></ul><ul><ul><ul><li>Each its own experiment in need of tracking </li></ul></ul></ul><ul><ul><li>Pools of items in the tens, hundreds, or millions </li></ul></ul>Content Optimization
    3. 3. Data Flow <ul><li>Processes hundreds of GBs of data every run </li></ul><ul><li>Generate model artifacts from every few mins to few hours </li></ul><ul><li>Builds models for millions of items, and users </li></ul><ul><li>Grids in use of varying sizes </li></ul>
    4. 4. Content Meta Data Source <ul><li>Meta data used for extracting item features </li></ul><ul><li>Stored in Hbase </li></ul>
    5. 5. Event Data Source <ul><li>Processed events from user activity on the site </li></ul><ul><li>Normalized and Loaded on to HDFS </li></ul>
    6. 6. HBase <ul><li>Maintain models for millions of items, and users </li></ul><ul><li>Efficient joins of small set of users with events to large set of models </li></ul><ul><li>Random lookups/writes to update models independently </li></ul>
    7. 7. HBase <ul><li>Fast scans allow for global operations </li></ul><ul><li>Columns families support ease of experimentation </li></ul><ul><li>State@TimeX easy for any record </li></ul><ul><li>Out of order event processing </li></ul>
    8. 8. MapReduce <ul><li>Uses Pig to abstract out MR details </li></ul><ul><li>Parallelism over HBase (LOAD, STORE) </li></ul>
    9. 9. Job co-ordination <ul><li>Orchestrates a workflow by stringing together multiple processing units </li></ul><ul><li>Schedules and launches jobs at desired intervals </li></ul>
    10. 10. Leader Election (Zookeeper) <ul><li>Leader Election for Pig/MR Jobs </li></ul><ul><li>Spreads load across multiple Workflow Machines </li></ul><ul><li>Provides Redundancy and node failure tolerance </li></ul>
    11. 11. Edge Services <ul><li>Things that don’t need to run on Hadoop </li></ul><ul><li>IP2GEO </li></ul><ul><li>Item Enrichment </li></ul>
    12. 12. Serving Store <ul><li>Low-latency access to key-valued data </li></ul><ul><li>Replicated to multiple colos </li></ul>
    13. 13. Questions? <ul><li>{dcampbel, aphadke,ashau}@yahoo-inc.com </li></ul>

    ×