Your SlideShare is downloading. ×
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

1 content optimization-hug-2010-07-21

1,196

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,196
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • Instant My! add from D White
  • Instant My! add from D White
  • Instant My! add from D White
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the agenda slide. There is only one of these in the deck. NOTES: What does X stories to run mean ? Can we be more clear on that Also – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describe Problem of matching the best content to the interest of a user Scale Millions of content slices Millions of users
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Online Content Optimization using Hadoop
      • Nitin Motgi
      • [email_address]
    • 2. What is Yahoo?…Listen Yahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For instance, on our today module in the front page, every 5 minutes we have 32,000 different variations of that module. So you don’t even know what I’m seeing in fact we serve a million different front page modules a day and that’s just through content optimization. And that’s just the beginning…Customized because we know the things you’re interested in ….
    • 3. What do we do ? “ Deliver right CONTENT to the right USER at the right TIME ” Keep Carol Bartz excited
    • 4. Relevance at Yahoo! Important Editors Popular Personal / Social People 10s of Items Science Millions of Items
    • 5. Ranking Problems Most Popular Most engaging overall based on objective metrics Related Items Behavioral Affinity: People who did X, did Y Deep Personalization Most relevant to me based on my deep interests Real-time Dashboard Business Optimization Light Personalization More relevant to me based on my age, gender and property usage Most Popular + Per User History Engaging overall, and aware of what I’ve already seen X Y
    • 6. Flow Optimization Engine Content feed with biz rules Explore ~1% Exploit ~99% Real-time Feedback Content Metadata Dashboard Optimized Module Real-time Insights Rules Engine
    • 7. How it happens ? At time ‘ t ’ User ‘ u ’ (user attr : age, gen, loc ) interacted with Content ‘ id ’ at Position ‘ o ’ Property/ S ite ‘ p ’ Section - s Module – m International - i ’ User Events Item Metadata Modeling ITEM Model USER Model Content ‘ id ’ Has associated metadata ‘ meta ’ m eta = {entity, keyword, geo, topic, category} Feature Generation Additional Content & User Feature Generation STORE: PNUTS 5 min latency Ranking B-Rules Request 5 – 30 min latency SLA 50 ms – 200 ms Item BASE M F ATTR CAT_Sports id 1 0.8 +1.2 -1.5 -0.9 1.0 id 2 -0.9 -0.9 +2.6 +0.3 1.0 Item BASE M F ATTR CAT_Sports u 1 0.8 1 1 0.2 u 2 -0.9 1 -1.2
    • 8. Models
      • USER x CONTENT FEATURES
      • USER MODEL : Tracks User interest in terms of Content Features
      • ITEM x USER FEATURES
        • ITEM MODEL : Tracks behavior of Item across user features
      • USER FEATURES x CONTENT FEATURES
          • PRIORS : Tracks interactions of user features with content features
      • USER x USER
      • CLUSTERING : Looks at User-User Affinity based on the feature vectors
      • ITEM x ITEM
      • CLUSTERING : Looks at Item-Item Affinity based on item feature vectors
    • 9. Scale
      • Million events per second
      • Hundreds of GB per run
      • Million of stories in pool
      • Tens of Thousands of Features (Content and/or User)
    • 10. Technology Stack Analytics and Debugging
    • 11. Modeling Framework
      • Global state provided by HBase
      • A collection of PIG UDFs
      • Flow for modeling or stages assembled in PIG
        • OLR
        • Clustering
        • Affinity
        • Regression Models
        • Decompositions (Cholesky …)
      • Configuration based behavioral changes for stages of modeling
        • Type of Features to generated
        • Type of joins to perform – User / Item / Feature
      • Input : DFS and/or HBase
      • Output: DFS and/or Hbase
      • Standard pattern for updating serving stores
      • <Source, Transformation, Sink>
      • E.g. <Hbase Table, Function(Features), PNUTS Table>
    • 12. HBase
      • ITEM Table
        • Stores item related features
        • Stores ITEM x USER FEATURES model
        • Stores parameters about item like view count, click count, unique user count.
        • 10 of Millions of Items
        • Updated every 5 minutes
      • USER Model
        • Store USER x CONTENT FEATURES model for each individual user by either a Unique ID
        • Stores summarized user history – Essential for Modeling in terms of item decay
        • Millions of profiles
        • Updated every 5 to 30 minutes
      • TERM Model
        • Inverts the Item Table and stores statistics for the terms.
        • Used to find the trending features and provide baselines for user features
        • Millions of terms and hundreds of parameters tracked
        • Updates every 5 minutes
    • 13. Grid Edge Services
      • Keeps MR jobs lean and mean
      • Provides ability to control non-gridifyable solutions to be deployed easily
      • Have different scaling characteristics (E.g. Memory, CPU)
      • Provide gateway for accessing external data sources in M/R
      • Map and/or Reduce step interact with Edge Services using standard client
      • Examples
        • Categorization
        • Geo Tagging
        • Feature Transformation
    • 14. Analytics and Debugging
      • Provides ability to debug modeling issues near-real time
      • Run complex queries for analysis
      • Easy to use interface
      • PM, Engineers, Research use this cluster to get near-real time insights
      • 10s of Modeling monitoring and Reporting queries every 5 minute
      • We use HIVE
    • 15.
    • 16. Data Flow
    • 17. Learnings
      • PIG & HBase has been best combination so far
        • Made it simple to build different kind of science models
        • Point lookup using HBase has proven to be very useful
        • Modeling = Matrices
          • HBase provides a natural way to represent and access them
      • Edge Services
        • Have provided simplicity to whole stack
        • Management (Upgrades, Outage) has been easy
      • HIVE has provided us a great way for analyzing the results
        • PIG was also considered
    • 18. Questions?

    ×