Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content
Upcoming SlideShare
Loading in...5
×
 

Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

on

  • 5,466 views

At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users ...

At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.

Statistics

Views

Total Views
5,466
Views on SlideShare
4,937
Embed Views
529

Actions

Likes
6
Downloads
104
Comments
1

9 Embeds 529

http://hbase.info 433
http://www.scoop.it 75
http://www.linkedin.com 6
url_unknown 5
https://www.linkedin.com 4
http://cache.baidu.com 3
http://www.slideshare.net 1
http://twitter.com 1
http://www.slashdocs.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • CBSi has a number of brands, this slide shows the biggest ones. I’m in the TBN group and the work I’ll present is being done for CNET, with the intent to be extended horizontally.
  • We have a lot of traffic and data. We’ve been using Hadoop quite extensively for a few years now. 135/150TB currently, soon to be 500TB.
  • Summarize what I’ll discuss
  • We do a number of these items already, but in disparate systems.
  • Simplified overview of the approach. Details to be discussed on the next data flow slide.
  • Multiple data load options – bulk, real-time, incremental update.MapReduce to examine data Export data to RDF in the triple store Analysts and engineers can access HBase or MR to explore data For now we’re using various triple stores for experimentation, we haven’t done a full evaluation yet. Technology for triple store or graph store still TBD.
  • The slope of this plot is subjective, but conceptually this is the case. HBase would be in the upper left quadrant and a graph store would be in the lower right. Our solution leverages the strength of each and we use MR to go from one to the other.
  • Just an example a graph we can build. The graph can be adapted to meet use cases. Anonymous user has relationships to other identities, as well as assets that he/she interacts with. The graph is built from items from different datasources: blue=firehose, orange=CMS, green=tagging systems, red=DW
  • Simple schema.1..* for both aliases and events.
  • The next few slides will though some specifics of the data flow.How do we get data into HBase? Once of the nice things about HBase is that it supports a number of techniques to load data.
  • Once data is in HBase, we selectively build RDF relationships to store in the triple store. Pig allows for easy iteration.
  • One of our more simple scripts. It’s 6 Pig statements to generate this set of RDF. We have a UDF to abstract out the RDF string construction.
  • Recommend the most recent blog content that is tagged with the same tags as the users FB like.
  • We’re going to need to support a number of use cases and integration patterns. This approach allows us to have multiple options on the table for each.
  • We want to be able to create a graph and effectively query it, but we also want to be able to to ad-hoc analytics and experimentation over the entire corpus of entities.

Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content Presentation Transcript

  • Using a Hadoop Data Pipeline to Build a Graph of Users and Content
    Hadoop Summit - June 29, 2011
    Bill Graham
    bill.graham@cbs.com
  • About me
    Principal Software Engineer
    Technology, Business & News BU (TBN)
    TBN Platform Infrastructure Team
    Background in SW Systems Engineering and Integration Architecture
    Contributor: Pig, Hive, HBase
    Committer: Chukwa
  • About CBSi – who are we?
    ENTERTAINMENT
    GAMES & MOVIES
    SPORTS
    TECH, BIZ & NEWS
    MUSIC
  • About CBSi - scale
    Top 10 global web property
    235M worldwide monthly uniques1
    Hadoop Ecosystem
    CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading
    Cluster size:
    Currently workers: 35 DW + 6 TBN (150TB)
    Next quarter: 100 nodes (500TB)
    DW peak processing: 400M events/day globally
    1 - Source: comScore, March 2011
  • Abstract
    At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.
  • The Problem
    User always voting on what they find interesting
    Got-it, want-it, like, share, follow, comment, rate, review, helpful vote, etc.
    Users have multiple identities
    Anonymous
    Registered (logged in)
    Social
    Multiple devices
    Connections between entities are in silo-ized sub-graphs
    Wealth of valuable user connectedness going unrealized
  • The Goal
    Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to:
    Content
    Authors
    Each other
    Themselves
    Better understand how our users connect to our content
    Improved content recommendations
    Improved user segmentation and content/ad targeting
  • Requirements
    Integrate with existing DW/BI Hadoop Infrastructure
    Aggregate data from across CBSi and beyond
    Connect disjointed user identities
    Flexible data model
    Assemble graph of relationships
    Enable rapid experimentation, data mining and hypothesis testing
    Power new site features and advertising optimizations
  • The Approach
    Mirror data into HBase
    Use MapReduce to process data
    Export RDF data into a triple store
  • Data Flow
    Site
    Triple
    Store
    SPARQL
    RDF
    CMS Publishing
    Site Activity Stream
    a.k.a. Firehose (JMS)
    HBase
    MapReduce
    • Pig
    • ImportTsv
    atomic writes
    transform
    & load
    Social/UGC Systems
    DW Systems
    HDFS
    bulk load
    CMS Systems
    Content Tagging Systems
  • NOSQL Data Models
    Key-value stores
    ColumnFamily
    Document databases
    Graph databases
    Data size
    Data complexity
    Credit: Emil Eifrem, Neotechnology
  • Conceptual Graph
    PageEvent
    PageEvent
    contains
    contains
    Brand
    SessionId
    regId
    is also
    is also
    Asset
    had session
    follow
    Author
    anonId
    like
    is also
    Asset
    follow
    Asset
    Author
    is also
    authored by
    Product
    tagged with
    tagged with
    Story
    tag
    Activity firehose (real-time)
    CMS (batch + incr.)
    Tags (batch)
    DW (daily)
  • HBase Schema
    user_info table
  • HBase Loading
    Incremental
    Consuming from a JMS queue == real-time
    Batch
    Pig’s HBaseStorage== quick to develop & iterate
    HBase’sImportTsv== more efficient
  • Generating RDF with Pig
    RDF1 is an XML standard to represent subject-predicate-object relationships
    Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store
    For example:
    “first class” graph citizens we plan to query on
    Implicit to explicit (i.e., derived) connections
    Content recommendations
    User segments
    Related users
    Content tags
    Easily join data to create new triples with Pig
    Run SPARQL2 queries, examine, refine, reload
    1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query
  • Example Pig RDF Script
    Create RDF triples of users to social events:
    RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’)
    AS (id:bytearray, event_map:map[]);
    -- Convert our maps to bags so we can flatten them out
    A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k, social_v);
    -- Convert the JSON events into maps
    B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[];
    -- Pull values from map
    C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid, social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ;
    EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple(
    'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ;
    STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();
  • Example SPARQL query
    Recommend content based on Facebook “liked” items:
    SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE {
    # anon-user who Like'd a content asset (news item, blog post) on Facebook
    <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x .
    ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” .
    ?x <urn:com.cbs.trident:ssite> "www.facebook.com" .
    ?x <urn:com.cbs.trident:tasset> ?asset1 .
    ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> .
    # a tag associated with the content asset
    ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 .
    ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .
    # other content assets with the same tag and their title
    ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1)
    ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .
    ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 .
    ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER
    (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>)
    } ORDER BY DESC (?pubdt2) LIMIT 10
  • Conclusions I - Power and Flexibility
    Architecture is flexible with respect to:
    Data modeling
    Integration patterns
    Data processing, querying techniques
    Multiple approaches for graph traversal
    SPARQL
    Traverse HBase
    MapReduce
  • Conclusions II – Match Tool with the Job
    Hadoop - scale and computing horsepower
    HBase – atomic r/w access, speed, flexibility
    RDF Triple Store – complex graph querying
    Pig – rapid MR prototyping and ad-hoc analysis
    Future:
    HCatalog – Schema & table management
    Oozie or Azkaban – Workflow engine
    Mahout – Machine learning
    Hama – Graph processing
  • Conclusions III – OSS, woot!
    If it doesn’t do what you want, submit a patch.
  • Questions?