Online Content Optimization using Hadoop Nitin Motgi [email_address]
What is Yahoo?…Listen Yahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For instance, on our today module in the front page, every 5 minutes we have 32,000 different variations of that module. So you don’t even know what I’m seeing in fact we serve a million different front page modules a day and that’s just through content optimization. And that’s just the beginning…Customized because we know the things you’re interested in ….
What do we do ? “ Deliver  right  CONTENT  to the  right  USER  at the  right  TIME ” Keep Carol Bartz excited
Relevance at Yahoo! Important Editors Popular Personal / Social People 10s of Items Science Millions of Items
Ranking Problems Most Popular Most engaging overall based on  objective  metrics Related Items Behavioral Affinity: People who did X, did Y Deep Personalization Most relevant to me based on my  deep interests Real-time Dashboard Business Optimization Light Personalization More relevant to me based on my  age, gender and property usage Most Popular + Per User History Engaging overall, and  aware of what I’ve already seen X Y
Flow Optimization Engine Content feed with biz rules Explore ~1% Exploit ~99% Real-time Feedback Content Metadata Dashboard Optimized Module Real-time Insights Rules Engine
How it happens ? At time ‘ t ’  User ‘ u ’ (user attr :  age, gen, loc ) interacted with Content ‘ id ’ at Position ‘ o ’ Property/ S ite ‘ p ’  Section  -  s Module  – m International -  i ’ User Events Item Metadata Modeling ITEM Model USER Model Content ‘ id ’ Has associated metadata ‘ meta ’ m eta = {entity, keyword, geo, topic, category}   Feature Generation Additional Content & User Feature Generation STORE: PNUTS 5 min latency Ranking B-Rules Request 5 – 30 min latency SLA  50 ms – 200 ms Item BASE M F ATTR CAT_Sports id 1 0.8 +1.2 -1.5 -0.9 1.0 id 2 -0.9 -0.9 +2.6 +0.3 1.0 Item BASE M F ATTR CAT_Sports u 1 0.8 1 1 0.2 u 2 -0.9 1 -1.2
Models USER x CONTENT FEATURES USER MODEL : Tracks User interest in terms of Content Features ITEM x USER FEATURES ITEM MODEL : Tracks behavior of Item across user features USER FEATURES x CONTENT FEATURES PRIORS : Tracks interactions of user features with content features USER x USER CLUSTERING : Looks at User-User Affinity based on the feature vectors ITEM x ITEM CLUSTERING : Looks at Item-Item Affinity based on item feature vectors
Scale Million events per second Hundreds of GB per run  Million of stories in pool Tens of Thousands of Features (Content and/or User)
Technology Stack Analytics and Debugging
Modeling Framework Global state provided by HBase A collection of PIG UDFs Flow for modeling or stages assembled in PIG OLR Clustering Affinity Regression Models Decompositions (Cholesky  …) Configuration based behavioral changes for stages of modeling Type of Features to generated Type of joins to perform – User / Item / Feature  Input : DFS and/or HBase Output: DFS and/or Hbase Standard pattern for updating serving stores <Source, Transformation, Sink> E.g.  <Hbase Table, Function(Features), PNUTS Table>
HBase ITEM Table Stores item related features Stores ITEM x USER FEATURES model  Stores parameters about item like view count, click count, unique user count. 10 of Millions of Items Updated every 5 minutes  USER Model Store USER x CONTENT FEATURES model for each individual user by either a Unique ID Stores summarized user history – Essential for Modeling in terms of item decay Millions of profiles Updated every 5 to 30 minutes TERM Model Inverts the  Item Table  and stores statistics for the terms.  Used to find the trending features and provide baselines for user features Millions of terms and hundreds of parameters tracked Updates every 5 minutes
Grid Edge Services Keeps MR jobs lean and mean  Provides ability to control non-gridifyable solutions to be deployed easily Have different scaling characteristics (E.g. Memory, CPU) Provide gateway for accessing external data sources in M/R Map and/or Reduce step interact with Edge Services using standard client Examples Categorization Geo Tagging Feature Transformation
Analytics and Debugging Provides ability to debug modeling issues near-real time Run complex queries for analysis  Easy to use interface PM, Engineers, Research use this cluster to get near-real time insights 10s of Modeling monitoring and Reporting queries every 5 minute We use HIVE
Data Flow
Learnings PIG & HBase has been best combination so far Made it simple to build different kind of science models Point lookup using HBase has proven to be very useful Modeling = Matrices HBase provides a natural way to represent and access them Edge Services  Have provided simplicity to whole stack Management (Upgrades, Outage) has been easy HIVE has provided us a great way for analyzing the results PIG was also considered
Questions?

1 content optimization-hug-2010-07-21

  • 1.
    Online Content Optimizationusing Hadoop Nitin Motgi [email_address]
  • 2.
    What is Yahoo?…ListenYahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For instance, on our today module in the front page, every 5 minutes we have 32,000 different variations of that module. So you don’t even know what I’m seeing in fact we serve a million different front page modules a day and that’s just through content optimization. And that’s just the beginning…Customized because we know the things you’re interested in ….
  • 3.
    What do wedo ? “ Deliver right CONTENT to the right USER at the right TIME ” Keep Carol Bartz excited
  • 4.
    Relevance at Yahoo!Important Editors Popular Personal / Social People 10s of Items Science Millions of Items
  • 5.
    Ranking Problems MostPopular Most engaging overall based on objective metrics Related Items Behavioral Affinity: People who did X, did Y Deep Personalization Most relevant to me based on my deep interests Real-time Dashboard Business Optimization Light Personalization More relevant to me based on my age, gender and property usage Most Popular + Per User History Engaging overall, and aware of what I’ve already seen X Y
  • 6.
    Flow Optimization EngineContent feed with biz rules Explore ~1% Exploit ~99% Real-time Feedback Content Metadata Dashboard Optimized Module Real-time Insights Rules Engine
  • 7.
    How it happens? At time ‘ t ’ User ‘ u ’ (user attr : age, gen, loc ) interacted with Content ‘ id ’ at Position ‘ o ’ Property/ S ite ‘ p ’ Section - s Module – m International - i ’ User Events Item Metadata Modeling ITEM Model USER Model Content ‘ id ’ Has associated metadata ‘ meta ’ m eta = {entity, keyword, geo, topic, category} Feature Generation Additional Content & User Feature Generation STORE: PNUTS 5 min latency Ranking B-Rules Request 5 – 30 min latency SLA 50 ms – 200 ms Item BASE M F ATTR CAT_Sports id 1 0.8 +1.2 -1.5 -0.9 1.0 id 2 -0.9 -0.9 +2.6 +0.3 1.0 Item BASE M F ATTR CAT_Sports u 1 0.8 1 1 0.2 u 2 -0.9 1 -1.2
  • 8.
    Models USER xCONTENT FEATURES USER MODEL : Tracks User interest in terms of Content Features ITEM x USER FEATURES ITEM MODEL : Tracks behavior of Item across user features USER FEATURES x CONTENT FEATURES PRIORS : Tracks interactions of user features with content features USER x USER CLUSTERING : Looks at User-User Affinity based on the feature vectors ITEM x ITEM CLUSTERING : Looks at Item-Item Affinity based on item feature vectors
  • 9.
    Scale Million eventsper second Hundreds of GB per run Million of stories in pool Tens of Thousands of Features (Content and/or User)
  • 10.
  • 11.
    Modeling Framework Globalstate provided by HBase A collection of PIG UDFs Flow for modeling or stages assembled in PIG OLR Clustering Affinity Regression Models Decompositions (Cholesky …) Configuration based behavioral changes for stages of modeling Type of Features to generated Type of joins to perform – User / Item / Feature Input : DFS and/or HBase Output: DFS and/or Hbase Standard pattern for updating serving stores <Source, Transformation, Sink> E.g. <Hbase Table, Function(Features), PNUTS Table>
  • 12.
    HBase ITEM TableStores item related features Stores ITEM x USER FEATURES model Stores parameters about item like view count, click count, unique user count. 10 of Millions of Items Updated every 5 minutes USER Model Store USER x CONTENT FEATURES model for each individual user by either a Unique ID Stores summarized user history – Essential for Modeling in terms of item decay Millions of profiles Updated every 5 to 30 minutes TERM Model Inverts the Item Table and stores statistics for the terms. Used to find the trending features and provide baselines for user features Millions of terms and hundreds of parameters tracked Updates every 5 minutes
  • 13.
    Grid Edge ServicesKeeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily Have different scaling characteristics (E.g. Memory, CPU) Provide gateway for accessing external data sources in M/R Map and/or Reduce step interact with Edge Services using standard client Examples Categorization Geo Tagging Feature Transformation
  • 14.
    Analytics and DebuggingProvides ability to debug modeling issues near-real time Run complex queries for analysis Easy to use interface PM, Engineers, Research use this cluster to get near-real time insights 10s of Modeling monitoring and Reporting queries every 5 minute We use HIVE
  • 16.
  • 17.
    Learnings PIG &HBase has been best combination so far Made it simple to build different kind of science models Point lookup using HBase has proven to be very useful Modeling = Matrices HBase provides a natural way to represent and access them Edge Services Have provided simplicity to whole stack Management (Upgrades, Outage) has been easy HIVE has provided us a great way for analyzing the results PIG was also considered
  • 18.

Editor's Notes

  • #2 This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • #3 Instant My! add from D White
  • #4 Instant My! add from D White
  • #5 Instant My! add from D White
  • #9 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #10 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #11 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #12 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #13 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #14 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #15 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #18 This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • #19 This is the final slide; generally for questions at the end of the talk. Please post your contact information here.