Using a Hadoop Data Pipeline to Build a Graph of Users and Content <br />Hadoop Summit - June 29, 2011<br />Bill Graham<br...
About me<br />Principal Software Engineer<br />Technology, Business & News BU (TBN)<br />TBN Platform Infrastructure Team<...
About CBSi – who are we?<br />ENTERTAINMENT <br />GAMES & MOVIES <br />SPORTS<br />TECH, BIZ & NEWS <br />MUSIC<br />
About CBSi - scale<br />Top 10 global web property<br />235M worldwide monthly uniques1<br />Hadoop Ecosystem<br />CDH3, P...
Abstract<br />   At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes ...
The Problem<br />User always voting on what they find interesting<br />Got-it, want-it, like, share, follow, comment, rate...
The Goal<br />Create a back-end platform that enables us to assemble a holistic graph of our users and their connections t...
Requirements<br />Integrate with existing DW/BI Hadoop Infrastructure<br />Aggregate data from across CBSi and beyond<br /...
The Approach<br />Mirror data into HBase <br />Use MapReduce to process data<br />Export RDF data into a triple store<br />
Data Flow<br />Site<br />Triple<br />Store<br />SPARQL<br />RDF<br />CMS Publishing<br />Site Activity Stream<br />a.k.a. ...
ImportTsv</li></ul>atomic writes<br />transform<br />& load<br />Social/UGC Systems<br />DW Systems<br />HDFS<br />bulk lo...
NOSQL Data Models<br />Key-value stores<br />ColumnFamily<br />Document databases<br />Graph databases<br />Data size<br /...
Conceptual Graph<br />PageEvent<br />PageEvent<br />contains<br />contains<br />Brand<br />SessionId<br />regId<br />is al...
HBase Schema<br />user_info table<br />
HBase Loading<br />Incremental<br />Consuming from a JMS queue == real-time<br />Batch<br />Pig’s HBaseStorage== quick to ...
Generating RDF with Pig<br />RDF1 is an XML standard to represent subject-predicate-object relationships<br />Philosophy: ...
Example Pig RDF Script<br />Create RDF triples of users to social events:<br />RAW = LOAD 'hbase://user_info' USING org.ap...
Example SPARQL query<br />  Recommend content based on Facebook “liked” items:<br />SELECT ?asset1 ?tagname ?asset2 ?title...
Conclusions I - Power and Flexibility<br />Architecture is flexible with respect to:<br />Data modeling<br />Integration p...
Conclusions II – Match Tool with the Job<br />Hadoop - scale and computing horsepower<br />HBase – atomic r/w access, spee...
Conclusions III – OSS, woot!<br />If it doesn’t do what you want, submit a patch.<br />
Upcoming SlideShare
Loading in...5
×

Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

5,130

Published on

At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,130
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
106
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide
  • CBSi has a number of brands, this slide shows the biggest ones. I’m in the TBN group and the work I’ll present is being done for CNET, with the intent to be extended horizontally.
  • We have a lot of traffic and data. We’ve been using Hadoop quite extensively for a few years now. 135/150TB currently, soon to be 500TB.
  • Summarize what I’ll discuss
  • We do a number of these items already, but in disparate systems.
  • Simplified overview of the approach. Details to be discussed on the next data flow slide.
  • Multiple data load options – bulk, real-time, incremental update.MapReduce to examine data Export data to RDF in the triple store Analysts and engineers can access HBase or MR to explore data For now we’re using various triple stores for experimentation, we haven’t done a full evaluation yet. Technology for triple store or graph store still TBD.
  • The slope of this plot is subjective, but conceptually this is the case. HBase would be in the upper left quadrant and a graph store would be in the lower right. Our solution leverages the strength of each and we use MR to go from one to the other.
  • Just an example a graph we can build. The graph can be adapted to meet use cases. Anonymous user has relationships to other identities, as well as assets that he/she interacts with. The graph is built from items from different datasources: blue=firehose, orange=CMS, green=tagging systems, red=DW
  • Simple schema.1..* for both aliases and events.
  • The next few slides will though some specifics of the data flow.How do we get data into HBase? Once of the nice things about HBase is that it supports a number of techniques to load data.
  • Once data is in HBase, we selectively build RDF relationships to store in the triple store. Pig allows for easy iteration.
  • One of our more simple scripts. It’s 6 Pig statements to generate this set of RDF. We have a UDF to abstract out the RDF string construction.
  • Recommend the most recent blog content that is tagged with the same tags as the users FB like.
  • We’re going to need to support a number of use cases and integration patterns. This approach allows us to have multiple options on the table for each.
  • We want to be able to create a graph and effectively query it, but we also want to be able to to ad-hoc analytics and experimentation over the entire corpus of entities.
  • Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

    1. 1. Using a Hadoop Data Pipeline to Build a Graph of Users and Content <br />Hadoop Summit - June 29, 2011<br />Bill Graham<br />bill.graham@cbs.com<br />
    2. 2. About me<br />Principal Software Engineer<br />Technology, Business & News BU (TBN)<br />TBN Platform Infrastructure Team<br />Background in SW Systems Engineering and Integration Architecture<br />Contributor: Pig, Hive, HBase<br />Committer: Chukwa<br />
    3. 3. About CBSi – who are we?<br />ENTERTAINMENT <br />GAMES & MOVIES <br />SPORTS<br />TECH, BIZ & NEWS <br />MUSIC<br />
    4. 4. About CBSi - scale<br />Top 10 global web property<br />235M worldwide monthly uniques1<br />Hadoop Ecosystem<br />CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading<br />Cluster size:<br />Currently workers: 35 DW + 6 TBN (150TB)<br />Next quarter: 100 nodes (500TB)<br />DW peak processing: 400M events/day globally<br />1 - Source: comScore, March 2011<br />
    5. 5. Abstract<br /> At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.<br />
    6. 6. The Problem<br />User always voting on what they find interesting<br />Got-it, want-it, like, share, follow, comment, rate, review, helpful vote, etc.<br />Users have multiple identities<br />Anonymous<br />Registered (logged in)<br />Social<br />Multiple devices<br />Connections between entities are in silo-ized sub-graphs<br />Wealth of valuable user connectedness going unrealized<br />
    7. 7. The Goal<br />Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to:<br />Content<br />Authors<br />Each other<br />Themselves<br />Better understand how our users connect to our content<br />Improved content recommendations<br />Improved user segmentation and content/ad targeting<br />
    8. 8. Requirements<br />Integrate with existing DW/BI Hadoop Infrastructure<br />Aggregate data from across CBSi and beyond<br />Connect disjointed user identities<br />Flexible data model<br />Assemble graph of relationships<br />Enable rapid experimentation, data mining and hypothesis testing<br />Power new site features and advertising optimizations<br />
    9. 9. The Approach<br />Mirror data into HBase <br />Use MapReduce to process data<br />Export RDF data into a triple store<br />
    10. 10. Data Flow<br />Site<br />Triple<br />Store<br />SPARQL<br />RDF<br />CMS Publishing<br />Site Activity Stream<br />a.k.a. Firehose (JMS)<br />HBase<br />MapReduce<br /><ul><li> Pig
    11. 11. ImportTsv</li></ul>atomic writes<br />transform<br />& load<br />Social/UGC Systems<br />DW Systems<br />HDFS<br />bulk load<br />CMS Systems<br />Content Tagging Systems<br />
    12. 12. NOSQL Data Models<br />Key-value stores<br />ColumnFamily<br />Document databases<br />Graph databases<br />Data size<br />Data complexity<br />Credit: Emil Eifrem, Neotechnology<br />
    13. 13. Conceptual Graph<br />PageEvent<br />PageEvent<br />contains<br />contains<br />Brand<br />SessionId<br />regId<br />is also<br />is also<br />Asset<br />had session<br />follow<br />Author<br />anonId<br />like<br />is also<br />Asset<br />follow<br />Asset<br />Author<br />is also<br />authored by<br />Product<br />tagged with<br />tagged with<br />Story<br />tag<br />Activity firehose (real-time)<br />CMS (batch + incr.)<br />Tags (batch)<br />DW (daily)<br />
    14. 14. HBase Schema<br />user_info table<br />
    15. 15. HBase Loading<br />Incremental<br />Consuming from a JMS queue == real-time<br />Batch<br />Pig’s HBaseStorage== quick to develop & iterate<br />HBase’sImportTsv== more efficient<br />
    16. 16. Generating RDF with Pig<br />RDF1 is an XML standard to represent subject-predicate-object relationships<br />Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store<br />For example:<br />“first class” graph citizens we plan to query on<br />Implicit to explicit (i.e., derived) connections<br />Content recommendations<br />User segments<br />Related users<br />Content tags<br />Easily join data to create new triples with Pig<br />Run SPARQL2 queries, examine, refine, reload<br />1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query <br />
    17. 17. Example Pig RDF Script<br />Create RDF triples of users to social events:<br />RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’)<br /> AS (id:bytearray, event_map:map[]);<br />-- Convert our maps to bags so we can flatten them out <br />A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k, social_v); <br />-- Convert the JSON events into maps <br />B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[]; <br />-- Pull values from map <br />C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid, social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ;<br />EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple(<br /> 'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ; <br />STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();<br />
    18. 18. Example SPARQL query<br /> Recommend content based on Facebook “liked” items:<br />SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE {<br /> # anon-user who Like'd a content asset (news item, blog post) on Facebook<br /> <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x .<br /> ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” .<br /> ?x <urn:com.cbs.trident:ssite> "www.facebook.com" .<br /> ?x <urn:com.cbs.trident:tasset> ?asset1 .<br /> ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> .<br /> # a tag associated with the content asset <br /> ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 . <br /> ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .<br /> # other content assets with the same tag and their title <br /> ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1)<br /> ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .<br /> ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 .<br /> ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER<br /> (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) <br />} ORDER BY DESC (?pubdt2) LIMIT 10<br />
    19. 19. Conclusions I - Power and Flexibility<br />Architecture is flexible with respect to:<br />Data modeling<br />Integration patterns<br />Data processing, querying techniques<br />Multiple approaches for graph traversal<br />SPARQL<br />Traverse HBase<br />MapReduce<br />
    20. 20. Conclusions II – Match Tool with the Job<br />Hadoop - scale and computing horsepower<br />HBase – atomic r/w access, speed, flexibility<br />RDF Triple Store – complex graph querying<br />Pig – rapid MR prototyping and ad-hoc analysis<br />Future:<br />HCatalog – Schema & table management<br />Oozie or Azkaban – Workflow engine<br />Mahout – Machine learning<br />Hama – Graph processing<br />
    21. 21. Conclusions III – OSS, woot!<br />If it doesn’t do what you want, submit a patch.<br />
    22. 22. Questions?<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×