Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Using a Hadoop Data Pipeline to Build a Graph of Users and Content Hadoop Summit - June 29, 2011 Bill Graham bill.graham@cbs.com

About me Principal Software Engineer Technology, Business & News BU (TBN) TBN Platform Infrastructure Team Background in SW Systems Engineering and Integration Architecture Contributor: Pig, Hive, HBase Committer: Chukwa

About CBSi – who are we? ENTERTAINMENT GAMES & MOVIES SPORTS TECH, BIZ & NEWS MUSIC

About CBSi - scale Top 10 global web property 235M worldwide monthly uniques1 Hadoop Ecosystem CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading Cluster size: Currently workers: 35 DW + 6 TBN (150TB) Next quarter: 100 nodes (500TB) DW peak processing: 400M events/day globally 1 - Source: comScore, March 2011

Abstract At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.

The Problem User always voting on what they find interesting Got-it, want-it, like, share, follow, comment, rate, review, helpful vote, etc. Users have multiple identities Anonymous Registered (logged in) Social Multiple devices Connections between entities are in silo-ized sub-graphs Wealth of valuable user connectedness going unrealized

The Goal Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to: Content Authors Each other Themselves Better understand how our users connect to our content Improved content recommendations Improved user segmentation and content/ad targeting

Requirements Integrate with existing DW/BI Hadoop Infrastructure Aggregate data from across CBSi and beyond Connect disjointed user identities Flexible data model Assemble graph of relationships Enable rapid experimentation, data mining and hypothesis testing Power new site features and advertising optimizations

The Approach Mirror data into HBase Use MapReduce to process data Export RDF data into a triple store

Data Flow Site Triple Store SPARQL RDF CMS Publishing Site Activity Stream a.k.a. Firehose (JMS) HBase MapReduce ,[object Object]

ImportTsvatomic writes transform & load Social/UGC Systems DW Systems HDFS bulk load CMS Systems Content Tagging Systems

NOSQL Data Models Key-value stores ColumnFamily Document databases Graph databases Data size Data complexity Credit: Emil Eifrem, Neotechnology

Conceptual Graph PageEvent PageEvent contains contains Brand SessionId regId is also is also Asset had session follow Author anonId like is also Asset follow Asset Author is also authored by Product tagged with tagged with Story tag Activity firehose (real-time) CMS (batch + incr.) Tags (batch) DW (daily)

HBase Loading Incremental Consuming from a JMS queue == real-time Batch Pig’s HBaseStorage== quick to develop & iterate HBase’sImportTsv== more efficient

Generating RDF with Pig RDF1 is an XML standard to represent subject-predicate-object relationships Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store For example: “first class” graph citizens we plan to query on Implicit to explicit (i.e., derived) connections Content recommendations User segments Related users Content tags Easily join data to create new triples with Pig Run SPARQL2 queries, examine, refine, reload 1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query

Example Pig RDF Script Create RDF triples of users to social events: RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’) AS (id:bytearray, event_map:map[]); -- Convert our maps to bags so we can flatten them out A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k, social_v); -- Convert the JSON events into maps B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[]; -- Pull values from map C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid, social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ; EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple( 'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ; STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();

Example SPARQL query Recommend content based on Facebook “liked” items: SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE { # anon-user who Like'd a content asset (news item, blog post) on Facebook <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x . ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” . ?x <urn:com.cbs.trident:ssite> "www.facebook.com" . ?x <urn:com.cbs.trident:tasset> ?asset1 . ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> . # a tag associated with the content asset ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 . ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname . # other content assets with the same tag and their title ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1) ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname . ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 . ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) } ORDER BY DESC (?pubdt2) LIMIT 10

Conclusions I - Power and Flexibility Architecture is flexible with respect to: Data modeling Integration patterns Data processing, querying techniques Multiple approaches for graph traversal SPARQL Traverse HBase MapReduce

Conclusions II – Match Tool with the Job Hadoop - scale and computing horsepower HBase – atomic r/w access, speed, flexibility RDF Triple Store – complex graph querying Pig – rapid MR prototyping and ad-hoc analysis Future: HCatalog – Schema & table management Oozie or Azkaban – Workflow engine Mahout – Machine learning Hama – Graph processing

Conclusions III – OSS, woot! If it doesn’t do what you want, submit a patch.

Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Similar to Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content (20)

Recently uploaded

Recently uploaded (20)

Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Editor's Notes