Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

1,410 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,410
On SlideShare
0
From Embeds
0
Number of Embeds
150
Actions
Shares
0
Downloads
38
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • This is the Title slide.Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the final slide; generally for questions at the end of the talk.Please post your contact information here.
  • Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

    1. 1. Hadoop Summit 2011<br />Online Content Optimization using Hadoop<br />Shail Aditya<br />shailg@yahoo-inc.com<br />
    2. 2. What do we do ?<br /><ul><li>Deliver right CONTENT to the right USER at the right TIME”
    3. 3. Effectively and “pro-actively” learn from user interactions with content that are displayed to maximize our objectives
    4. 4. A new scientific discipline at the interface of
    5. 5. Large scale Machine Learning and Statistics
    6. 6. Multi-objective optimization in the presence of uncertainty
    7. 7. User understanding
    8. 8. Content understanding</li></li></ul><li>Content Relevance at Yahoo!<br />Editorial<br />10s of Items<br />Important<br />Editors<br />Popular<br />Personal / Social<br />Science<br />Millions of Items<br />
    9. 9. Content Ranking Problems<br />X<br />Y<br />Most Popular<br />Most engaging overall based on objective metrics<br />Most Popular + Per User History<br />Rotate stories I’ve already seen<br />Light Personalization<br />More relevant to me based on my age, gender, location, and property usage<br />Deep Personalization<br />Most relevant to me based on my deep interests (entities, sources, categories, keywords)<br />Related Items and Context-Sensitive Models<br />Behavioral Affinity: People who did X, did Y<br />Most engaging in this page/section/property/device/referral context?<br />Layout Optimization<br />Which modules/ad units should be shown to this user in this context?<br />Revenue Optimization<br />Voice and Business Rules<br />Real-time Dashboard<br />
    10. 10. Yahoo Frontpage<br />Trending Now <br />(Most popular)<br />Today Module<br />(Light personalization)<br />Personal<br />Assistant<br />(Light<br />Personalization)<br />National News<br />(Most Popular + <br />User History bucket)<br />Deals <br />(most popular)<br />
    11. 11. Recommendation: A Match-making Problem<br /><ul><li> Recommendation problems
    12. 12. Search: Web, Vertical
    13. 13. Online advertising
    14. 14. …</li></ul>Item Inventory<br />Articles, web page, <br />ads, …<br />Use an automated algorithm <br />to select item(s) to show<br />Get feedback (click, time spent,..) <br />Refine the models<br />Repeat (large number of times)<br />Measure metric(s) of interest<br />(Total clicks, Total revenue,…)<br />Opportunity<br />Users, queries, <br />pages, …<br />
    15. 15. Problem Characteristics : Today module<br />Traffic obtained from a controlled randomized experiment<br />Things to note: <br />a) Short lifetimes b) temporal effects c) often breaking news story<br />
    16. 16. Scale: Why use Hadoop?<br />Million events per second (user view/click, content update)<br />Hundreds of GB data collected and modeled per run <br />Millions of items in pool<br />Millions of user profiles<br />Tens of thousands of Features (Content and/or User)<br />
    17. 17. Data Flow<br />Optimization Engine<br />Content feed with biz rules<br />Rules Engine<br />Content Metadata<br />Exploit<br />~99%<br />Explore<br />~1%<br />Near Real-time<br />Feedback<br />Real-time<br />Insights<br />Dashboard<br />Optimized Module<br />
    18. 18. How it happens ?<br />Additional Content & User<br />Feature Generation<br />Feature<br />Generation<br />ITEM Model<br />STORE: HBASE<br />5 min<br />latency<br />Request<br />User<br />Events<br />Modeling<br />Ranking<br />B-Rules<br />SLA <br />50 ms – 200 ms<br />5 – 30 min<br />latency<br />At time ‘t’ <br />User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ at<br />Position ‘o’<br />Property/Site ‘p’ <br />Section - s<br />Module – m<br />International - i’<br />STORE: PNUTS<br />USER Model<br />Content ‘id’<br />Has associated metadata ‘meta’<br />meta = {entity, keyword, geo, topic, category}<br />Item<br />Metadata<br />
    19. 19. Models<br /><ul><li>USER x CONTENT FEATURES</li></ul> USER MODEL : Tracks User interest in terms of Content Features<br /><ul><li>ITEM x USER FEATURES</li></ul> ITEM MODEL : Tracks behavior of Item across user features<br /><ul><li>USER FEATURES x CONTENT FEATURES</li></ul>PRIORS : Tracks interactions of user features with content features<br /><ul><li>USER x USER</li></ul> CLUSTERING : Looks at User-User Affinity based on the feature vectors<br /><ul><li>ITEM x ITEM</li></ul> CLUSTERING : Looks at Item-Item Affinity based on item feature vectors<br />
    20. 20. Technology Stack<br />Ingest<br />Analytics and<br /> Debugging<br />
    21. 21. Modeling Framework<br /><ul><li>Global state provided by HBase
    22. 22. Hadoop processing via a collection of PIG UDFs
    23. 23. Different flows for modeling or stages assembled in PIG
    24. 24. OLR, Clustering, Affinity, Regression Models, Decompositions (Cholesky…)
    25. 25. Timeseries models (generally trends – extract of user activity on content)
    26. 26. Configuration based behavior for various stages of modeling
    27. 27. Type of Features to be generated
    28. 28. Type of joins to perform – User / Item / Feature
    29. 29. Input : DFS and/or HBase
    30. 30. Output: DFS and/or HBase</li></li></ul><li>HBase<br /><ul><li> ITEM Model
    31. 31. Stores item related features
    32. 32. Stores ITEM x USER FEATURES model
    33. 33. Stores parameters about item like view count, click count, unique user count.
    34. 34. 10 of Millions of Items
    35. 35. Updated every 5 minutes
    36. 36. USER Model
    37. 37. Store USER x CONTENT FEATURES model for each individual user by either a Unique ID
    38. 38. Stores summarized user history – Essential for Modeling in terms of item decay
    39. 39. Millions of profiles
    40. 40. Updated every 5 to 30 minutes
    41. 41. TERM Model
    42. 42. Inverts the Item Table and stores statistics for the terms.
    43. 43. Used to find the trending features and provide baselines for user features
    44. 44. Millions of terms and hundreds of parameters tracked
    45. 45. Updates every 5 minutes</li></li></ul><li>Grid Edge Services<br /><ul><li> Keeps MR jobs lean and mean
    46. 46. Provides ability to control non-gridifyable solutions to be deployed easily
    47. 47. Have different scaling characteristics (E.g. Memory, CPU)
    48. 48. Provide gateway for accessing external data sources in M/R
    49. 49. Map and/or Reduce step interact with Edge Services using standard client
    50. 50. Examples
    51. 51. Categorization
    52. 52. Geo Tagging
    53. 53. Feature Transformation</li></li></ul><li>Analytics and Debugging<br /><ul><li> Provides ability to debug modeling issues near-real time
    54. 54. Run complex queries for analysis
    55. 55. Easy to use interface
    56. 56. PM, Engineers, Research use this cluster to get near-real time insights
    57. 57. 10s of Modeling monitoring and Reporting queries every 5 minute
    58. 58. We use HIVE</li></li></ul><li>Learnings<br /><ul><li> PIG & HBase has been best combination so far
    59. 59. Made it simple to build different kind of science models
    60. 60. Point lookup using HBase has proven to be very useful
    61. 61. Modeling = Matrices
    62. 62. HBase provides a natural way to represent and access them
    63. 63. Edge Services
    64. 64. Have provided simplicity to whole stack
    65. 65. Management (Upgrades, Outage) has been easy
    66. 66. HIVE has provided us a great way for analyzing the results
    67. 67. PIG was also considered</li></li></ul><li>Thank you<br />Deliver right CONTENT to the right USER at the right TIME”<br />Shail Aditya<br />shailg@yahoo-inc.com<br />

    ×