1 content optimization-hug-2010-07-21
Upcoming SlideShare
Loading in...5
×
 

1 content optimization-hug-2010-07-21

on

  • 1,362 views

 

Statistics

Views

Total Views
1,362
Views on SlideShare
1,310
Embed Views
52

Actions

Likes
1
Downloads
36
Comments
0

1 Embed 52

http://developer.yahoo.net 52

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • Instant My! add from D White
  • Instant My! add from D White
  • Instant My! add from D White
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

1 content optimization-hug-2010-07-21 1 content optimization-hug-2010-07-21 Presentation Transcript

  • Online Content Optimization using Hadoop
    • Nitin Motgi
    • [email_address]
  • What is Yahoo?…Listen Yahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For instance, on our today module in the front page, every 5 minutes we have 32,000 different variations of that module. So you don’t even know what I’m seeing in fact we serve a million different front page modules a day and that’s just through content optimization. And that’s just the beginning…Customized because we know the things you’re interested in ….
  • What do we do ? “ Deliver right CONTENT to the right USER at the right TIME ” Keep Carol Bartz excited
  • Relevance at Yahoo! Important Editors Popular Personal / Social People 10s of Items Science Millions of Items
  • Ranking Problems Most Popular Most engaging overall based on objective metrics Related Items Behavioral Affinity: People who did X, did Y Deep Personalization Most relevant to me based on my deep interests Real-time Dashboard Business Optimization Light Personalization More relevant to me based on my age, gender and property usage Most Popular + Per User History Engaging overall, and aware of what I’ve already seen X Y
  • Flow Optimization Engine Content feed with biz rules Explore ~1% Exploit ~99% Real-time Feedback Content Metadata Dashboard Optimized Module Real-time Insights Rules Engine
  • How it happens ? At time ‘ t ’ User ‘ u ’ (user attr : age, gen, loc ) interacted with Content ‘ id ’ at Position ‘ o ’ Property/ S ite ‘ p ’ Section - s Module – m International - i ’ User Events Item Metadata Modeling ITEM Model USER Model Content ‘ id ’ Has associated metadata ‘ meta ’ m eta = {entity, keyword, geo, topic, category} Feature Generation Additional Content & User Feature Generation STORE: PNUTS 5 min latency Ranking B-Rules Request 5 – 30 min latency SLA 50 ms – 200 ms Item BASE M F ATTR CAT_Sports id 1 0.8 +1.2 -1.5 -0.9 1.0 id 2 -0.9 -0.9 +2.6 +0.3 1.0 Item BASE M F ATTR CAT_Sports u 1 0.8 1 1 0.2 u 2 -0.9 1 -1.2
  • Models
    • USER x CONTENT FEATURES
    • USER MODEL : Tracks User interest in terms of Content Features
    • ITEM x USER FEATURES
      • ITEM MODEL : Tracks behavior of Item across user features
    • USER FEATURES x CONTENT FEATURES
        • PRIORS : Tracks interactions of user features with content features
    • USER x USER
    • CLUSTERING : Looks at User-User Affinity based on the feature vectors
    • ITEM x ITEM
    • CLUSTERING : Looks at Item-Item Affinity based on item feature vectors
  • Scale
    • Million events per second
    • Hundreds of GB per run
    • Million of stories in pool
    • Tens of Thousands of Features (Content and/or User)
  • Technology Stack Analytics and Debugging
  • Modeling Framework
    • Global state provided by HBase
    • A collection of PIG UDFs
    • Flow for modeling or stages assembled in PIG
      • OLR
      • Clustering
      • Affinity
      • Regression Models
      • Decompositions (Cholesky …)
    • Configuration based behavioral changes for stages of modeling
      • Type of Features to generated
      • Type of joins to perform – User / Item / Feature
    • Input : DFS and/or HBase
    • Output: DFS and/or Hbase
    • Standard pattern for updating serving stores
    • <Source, Transformation, Sink>
    • E.g. <Hbase Table, Function(Features), PNUTS Table>
  • HBase
    • ITEM Table
      • Stores item related features
      • Stores ITEM x USER FEATURES model
      • Stores parameters about item like view count, click count, unique user count.
      • 10 of Millions of Items
      • Updated every 5 minutes
    • USER Model
      • Store USER x CONTENT FEATURES model for each individual user by either a Unique ID
      • Stores summarized user history – Essential for Modeling in terms of item decay
      • Millions of profiles
      • Updated every 5 to 30 minutes
    • TERM Model
      • Inverts the Item Table and stores statistics for the terms.
      • Used to find the trending features and provide baselines for user features
      • Millions of terms and hundreds of parameters tracked
      • Updates every 5 minutes
  • Grid Edge Services
    • Keeps MR jobs lean and mean
    • Provides ability to control non-gridifyable solutions to be deployed easily
    • Have different scaling characteristics (E.g. Memory, CPU)
    • Provide gateway for accessing external data sources in M/R
    • Map and/or Reduce step interact with Edge Services using standard client
    • Examples
      • Categorization
      • Geo Tagging
      • Feature Transformation
  • Analytics and Debugging
    • Provides ability to debug modeling issues near-real time
    • Run complex queries for analysis
    • Easy to use interface
    • PM, Engineers, Research use this cluster to get near-real time insights
    • 10s of Modeling monitoring and Reporting queries every 5 minute
    • We use HIVE
  • Data Flow
  • Learnings
    • PIG & HBase has been best combination so far
      • Made it simple to build different kind of science models
      • Point lookup using HBase has proven to be very useful
      • Modeling = Matrices
        • HBase provides a natural way to represent and access them
    • Edge Services
      • Have provided simplicity to whole stack
      • Management (Upgrades, Outage) has been easy
    • HIVE has provided us a great way for analyzing the results
      • PIG was also considered
  • Questions?