• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya
 

Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya

on

  • 1,284 views

 

Statistics

Views

Total Views
1,284
Views on SlideShare
1,136
Embed Views
148

Actions

Likes
1
Downloads
33
Comments
0

1 Embed 148

http://d.hatena.ne.jp 148

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the Title slide.Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the agenda slide. There is only one of these in the deck.NOTES:What does X stories to run mean ? Can we be more clear on thatAlso – This should be a more a punch line of what we do. This slide to me is very broad and not clear. Following are the things that I would describeProblem of matching the best content to the interest of a userScale Millions of content slicesMillions of users
  • This is the final slide; generally for questions at the end of the talk.Please post your contact information here.

Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoop" by Shail Aditya Presentation Transcript

  • Hadoop Summit 2011
    Online Content Optimization using Hadoop
    Shail Aditya
    shailg@yahoo-inc.com
  • What do we do ?
    • Deliver right CONTENT to the right USER at the right TIME”
    • Effectively and “pro-actively” learn from user interactions with content that are displayed to maximize our objectives
    • A new scientific discipline at the interface of
    • Large scale Machine Learning and Statistics
    • Multi-objective optimization in the presence of uncertainty
    • User understanding
    • Content understanding
  • Content Relevance at Yahoo!
    Editorial
    10s of Items
    Important
    Editors
    Popular
    Personal / Social
    Science
    Millions of Items
  • Content Ranking Problems
    X
    Y
    Most Popular
    Most engaging overall based on objective metrics
    Most Popular + Per User History
    Rotate stories I’ve already seen
    Light Personalization
    More relevant to me based on my age, gender, location, and property usage
    Deep Personalization
    Most relevant to me based on my deep interests (entities, sources, categories, keywords)
    Related Items and Context-Sensitive Models
    Behavioral Affinity: People who did X, did Y
    Most engaging in this page/section/property/device/referral context?
    Layout Optimization
    Which modules/ad units should be shown to this user in this context?
    Revenue Optimization
    Voice and Business Rules
    Real-time Dashboard
  • Yahoo Frontpage
    Trending Now
    (Most popular)
    Today Module
    (Light personalization)
    Personal
    Assistant
    (Light
    Personalization)
    National News
    (Most Popular +
    User History bucket)
    Deals
    (most popular)
  • Recommendation: A Match-making Problem
    • Recommendation problems
    • Search: Web, Vertical
    • Online advertising
    Item Inventory
    Articles, web page,
    ads, …
    Use an automated algorithm
    to select item(s) to show
    Get feedback (click, time spent,..)
    Refine the models
    Repeat (large number of times)
    Measure metric(s) of interest
    (Total clicks, Total revenue,…)
    Opportunity
    Users, queries,
    pages, …
  • Problem Characteristics : Today module
    Traffic obtained from a controlled randomized experiment
    Things to note:
    a) Short lifetimes b) temporal effects c) often breaking news story
  • Scale: Why use Hadoop?
    Million events per second (user view/click, content update)
    Hundreds of GB data collected and modeled per run
    Millions of items in pool
    Millions of user profiles
    Tens of thousands of Features (Content and/or User)
  • Data Flow
    Optimization Engine
    Content feed with biz rules
    Rules Engine
    Content Metadata
    Exploit
    ~99%
    Explore
    ~1%
    Near Real-time
    Feedback
    Real-time
    Insights
    Dashboard
    Optimized Module
  • How it happens ?
    Additional Content & User
    Feature Generation
    Feature
    Generation
    ITEM Model
    STORE: HBASE
    5 min
    latency
    Request
    User
    Events
    Modeling
    Ranking
    B-Rules
    SLA
    50 ms – 200 ms
    5 – 30 min
    latency
    At time ‘t’
    User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ at
    Position ‘o’
    Property/Site ‘p’
    Section - s
    Module – m
    International - i’
    STORE: PNUTS
    USER Model
    Content ‘id’
    Has associated metadata ‘meta’
    meta = {entity, keyword, geo, topic, category}
    Item
    Metadata
  • Models
    • USER x CONTENT FEATURES
    USER MODEL : Tracks User interest in terms of Content Features
    • ITEM x USER FEATURES
    ITEM MODEL : Tracks behavior of Item across user features
    • USER FEATURES x CONTENT FEATURES
    PRIORS : Tracks interactions of user features with content features
    • USER x USER
    CLUSTERING : Looks at User-User Affinity based on the feature vectors
    • ITEM x ITEM
    CLUSTERING : Looks at Item-Item Affinity based on item feature vectors
  • Technology Stack
    Ingest
    Analytics and
    Debugging
  • Modeling Framework
    • Global state provided by HBase
    • Hadoop processing via a collection of PIG UDFs
    • Different flows for modeling or stages assembled in PIG
    • OLR, Clustering, Affinity, Regression Models, Decompositions (Cholesky…)
    • Timeseries models (generally trends – extract of user activity on content)
    • Configuration based behavior for various stages of modeling
    • Type of Features to be generated
    • Type of joins to perform – User / Item / Feature
    • Input : DFS and/or HBase
    • Output: DFS and/or HBase
  • HBase
    • ITEM Model
    • Stores item related features
    • Stores ITEM x USER FEATURES model
    • Stores parameters about item like view count, click count, unique user count.
    • 10 of Millions of Items
    • Updated every 5 minutes
    • USER Model
    • Store USER x CONTENT FEATURES model for each individual user by either a Unique ID
    • Stores summarized user history – Essential for Modeling in terms of item decay
    • Millions of profiles
    • Updated every 5 to 30 minutes
    • TERM Model
    • Inverts the Item Table and stores statistics for the terms.
    • Used to find the trending features and provide baselines for user features
    • Millions of terms and hundreds of parameters tracked
    • Updates every 5 minutes
  • Grid Edge Services
    • Keeps MR jobs lean and mean
    • Provides ability to control non-gridifyable solutions to be deployed easily
    • Have different scaling characteristics (E.g. Memory, CPU)
    • Provide gateway for accessing external data sources in M/R
    • Map and/or Reduce step interact with Edge Services using standard client
    • Examples
    • Categorization
    • Geo Tagging
    • Feature Transformation
  • Analytics and Debugging
    • Provides ability to debug modeling issues near-real time
    • Run complex queries for analysis
    • Easy to use interface
    • PM, Engineers, Research use this cluster to get near-real time insights
    • 10s of Modeling monitoring and Reporting queries every 5 minute
    • We use HIVE
  • Learnings
    • PIG & HBase has been best combination so far
    • Made it simple to build different kind of science models
    • Point lookup using HBase has proven to be very useful
    • Modeling = Matrices
    • HBase provides a natural way to represent and access them
    • Edge Services
    • Have provided simplicity to whole stack
    • Management (Upgrades, Outage) has been easy
    • HIVE has provided us a great way for analyzing the results
    • PIG was also considered
  • Thank you
    Deliver right CONTENT to the right USER at the right TIME”
    Shail Aditya
    shailg@yahoo-inc.com