Online Content Optimization using Hadoop <ul><li>Nitin Motgi </li></ul><ul><li>[email_address] </li></ul>
What is Yahoo?…Listen Yahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For ...
What do we do ? “ Deliver  right  CONTENT  to the  right  USER  at the  right  TIME ” Keep Carol Bartz excited
Relevance at Yahoo! Important Editors Popular Personal / Social People 10s of Items Science Millions of Items
Ranking Problems Most Popular Most engaging overall based on  objective  metrics Related Items Behavioral Affinity: People...
Flow Optimization Engine Content feed with biz rules Explore ~1% Exploit ~99% Real-time Feedback Content Metadata Dashboar...
How it happens ? At time ‘ t ’  User ‘ u ’ (user attr :  age, gen, loc ) interacted with Content ‘ id ’ at Position ‘ o ’ ...
Models <ul><li>USER x CONTENT FEATURES </li></ul><ul><li>USER MODEL : Tracks User interest in terms of Content Features </...
Scale <ul><li>Million events per second </li></ul><ul><li>Hundreds of GB per run  </li></ul><ul><li>Million of stories in ...
Technology Stack Analytics and Debugging
Modeling Framework <ul><li>Global state provided by HBase </li></ul><ul><li>A collection of PIG UDFs </li></ul><ul><li>Flo...
HBase <ul><li>ITEM Table </li></ul><ul><ul><li>Stores item related features </li></ul></ul><ul><ul><li>Stores ITEM x USER ...
Grid Edge Services <ul><li>Keeps MR jobs lean and mean  </li></ul><ul><li>Provides ability to control non-gridifyable solu...
Analytics and Debugging <ul><li>Provides ability to debug modeling issues near-real time </li></ul><ul><li>Run complex que...
Data Flow
Learnings <ul><li>PIG & HBase has been best combination so far </li></ul><ul><ul><li>Made it simple to build different kin...
Questions?
Upcoming SlideShare
Loading in...5
×

1 content optimization-hug-2010-07-21

1,275

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,275
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • Instant My! add from D White
  • Instant My! add from D White
  • Instant My! add from D White
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • 1 content optimization-hug-2010-07-21

    1. 1. Online Content Optimization using Hadoop <ul><li>Nitin Motgi </li></ul><ul><li>[email_address] </li></ul>
    2. 2. What is Yahoo?…Listen Yahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For instance, on our today module in the front page, every 5 minutes we have 32,000 different variations of that module. So you don’t even know what I’m seeing in fact we serve a million different front page modules a day and that’s just through content optimization. And that’s just the beginning…Customized because we know the things you’re interested in ….
    3. 3. What do we do ? “ Deliver right CONTENT to the right USER at the right TIME ” Keep Carol Bartz excited
    4. 4. Relevance at Yahoo! Important Editors Popular Personal / Social People 10s of Items Science Millions of Items
    5. 5. Ranking Problems Most Popular Most engaging overall based on objective metrics Related Items Behavioral Affinity: People who did X, did Y Deep Personalization Most relevant to me based on my deep interests Real-time Dashboard Business Optimization Light Personalization More relevant to me based on my age, gender and property usage Most Popular + Per User History Engaging overall, and aware of what I’ve already seen X Y
    6. 6. Flow Optimization Engine Content feed with biz rules Explore ~1% Exploit ~99% Real-time Feedback Content Metadata Dashboard Optimized Module Real-time Insights Rules Engine
    7. 7. How it happens ? At time ‘ t ’ User ‘ u ’ (user attr : age, gen, loc ) interacted with Content ‘ id ’ at Position ‘ o ’ Property/ S ite ‘ p ’ Section - s Module – m International - i ’ User Events Item Metadata Modeling ITEM Model USER Model Content ‘ id ’ Has associated metadata ‘ meta ’ m eta = {entity, keyword, geo, topic, category} Feature Generation Additional Content & User Feature Generation STORE: PNUTS 5 min latency Ranking B-Rules Request 5 – 30 min latency SLA 50 ms – 200 ms Item BASE M F ATTR CAT_Sports id 1 0.8 +1.2 -1.5 -0.9 1.0 id 2 -0.9 -0.9 +2.6 +0.3 1.0 Item BASE M F ATTR CAT_Sports u 1 0.8 1 1 0.2 u 2 -0.9 1 -1.2
    8. 8. Models <ul><li>USER x CONTENT FEATURES </li></ul><ul><li>USER MODEL : Tracks User interest in terms of Content Features </li></ul><ul><li>ITEM x USER FEATURES </li></ul><ul><ul><li>ITEM MODEL : Tracks behavior of Item across user features </li></ul></ul><ul><li>USER FEATURES x CONTENT FEATURES </li></ul><ul><ul><ul><li>PRIORS : Tracks interactions of user features with content features </li></ul></ul></ul><ul><li>USER x USER </li></ul><ul><li>CLUSTERING : Looks at User-User Affinity based on the feature vectors </li></ul><ul><li>ITEM x ITEM </li></ul><ul><li>CLUSTERING : Looks at Item-Item Affinity based on item feature vectors </li></ul>
    9. 9. Scale <ul><li>Million events per second </li></ul><ul><li>Hundreds of GB per run </li></ul><ul><li>Million of stories in pool </li></ul><ul><li>Tens of Thousands of Features (Content and/or User) </li></ul>
    10. 10. Technology Stack Analytics and Debugging
    11. 11. Modeling Framework <ul><li>Global state provided by HBase </li></ul><ul><li>A collection of PIG UDFs </li></ul><ul><li>Flow for modeling or stages assembled in PIG </li></ul><ul><ul><li>OLR </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Affinity </li></ul></ul><ul><ul><li>Regression Models </li></ul></ul><ul><ul><li>Decompositions (Cholesky …) </li></ul></ul><ul><li>Configuration based behavioral changes for stages of modeling </li></ul><ul><ul><li>Type of Features to generated </li></ul></ul><ul><ul><li>Type of joins to perform – User / Item / Feature </li></ul></ul><ul><li>Input : DFS and/or HBase </li></ul><ul><li>Output: DFS and/or Hbase </li></ul><ul><li>Standard pattern for updating serving stores </li></ul><ul><li><Source, Transformation, Sink> </li></ul><ul><li>E.g. <Hbase Table, Function(Features), PNUTS Table> </li></ul>
    12. 12. HBase <ul><li>ITEM Table </li></ul><ul><ul><li>Stores item related features </li></ul></ul><ul><ul><li>Stores ITEM x USER FEATURES model </li></ul></ul><ul><ul><li>Stores parameters about item like view count, click count, unique user count. </li></ul></ul><ul><ul><li>10 of Millions of Items </li></ul></ul><ul><ul><li>Updated every 5 minutes </li></ul></ul><ul><li>USER Model </li></ul><ul><ul><li>Store USER x CONTENT FEATURES model for each individual user by either a Unique ID </li></ul></ul><ul><ul><li>Stores summarized user history – Essential for Modeling in terms of item decay </li></ul></ul><ul><ul><li>Millions of profiles </li></ul></ul><ul><ul><li>Updated every 5 to 30 minutes </li></ul></ul><ul><li>TERM Model </li></ul><ul><ul><li>Inverts the Item Table and stores statistics for the terms. </li></ul></ul><ul><ul><li>Used to find the trending features and provide baselines for user features </li></ul></ul><ul><ul><li>Millions of terms and hundreds of parameters tracked </li></ul></ul><ul><ul><li>Updates every 5 minutes </li></ul></ul>
    13. 13. Grid Edge Services <ul><li>Keeps MR jobs lean and mean </li></ul><ul><li>Provides ability to control non-gridifyable solutions to be deployed easily </li></ul><ul><li>Have different scaling characteristics (E.g. Memory, CPU) </li></ul><ul><li>Provide gateway for accessing external data sources in M/R </li></ul><ul><li>Map and/or Reduce step interact with Edge Services using standard client </li></ul><ul><li>Examples </li></ul><ul><ul><li>Categorization </li></ul></ul><ul><ul><li>Geo Tagging </li></ul></ul><ul><ul><li>Feature Transformation </li></ul></ul>
    14. 14. Analytics and Debugging <ul><li>Provides ability to debug modeling issues near-real time </li></ul><ul><li>Run complex queries for analysis </li></ul><ul><li>Easy to use interface </li></ul><ul><li>PM, Engineers, Research use this cluster to get near-real time insights </li></ul><ul><li>10s of Modeling monitoring and Reporting queries every 5 minute </li></ul><ul><li>We use HIVE </li></ul>
    15. 15.
    16. 16. Data Flow
    17. 17. Learnings <ul><li>PIG & HBase has been best combination so far </li></ul><ul><ul><li>Made it simple to build different kind of science models </li></ul></ul><ul><ul><li>Point lookup using HBase has proven to be very useful </li></ul></ul><ul><ul><li>Modeling = Matrices </li></ul></ul><ul><ul><ul><li>HBase provides a natural way to represent and access them </li></ul></ul></ul><ul><li>Edge Services </li></ul><ul><ul><li>Have provided simplicity to whole stack </li></ul></ul><ul><ul><li>Management (Upgrades, Outage) has been easy </li></ul></ul><ul><li>HIVE has provided us a great way for analyzing the results </li></ul><ul><ul><li>PIG was also considered </li></ul></ul>
    18. 18. Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×