PayPal Behavioral Analytics on Hadoop

6,407 views

Published on

Published in: Technology, Business

PayPal Behavioral Analytics on Hadoop

  1. 1. PAYPAL - BEHAVIORAL TRACKING ON HADOOPANIL MADANDIRECTOR OF ENGINEERING , MARKETING & ANALYTICS
  2. 2. PAYPALS VISION Delivering the future of money today… An essential part of our customers financial and businesslives, enabling secure commerce anywhere, anytime, any way 110 million active accounts , 190 markets , 25 currencies 2
  3. 3. BEHAVIORAL TRACKING VISION Understand our anytime, anywhere, any way to drive desirablecustomer’s behavior outcomes for our and experience customers and for PayPal. Enable self-service analytics for our product and Ensure marketing teamsEnsure privacy, instrumentationsecurity and trust standardization 3for our customers across channels 3
  4. 4. TRACKING PLATFORM OVERVIEW Direct/ Transaction Email Display Search Home Page Emails Marketing Advertising Engine Marketing Metadata Tracking Servers Real Time Systems Tracking Metadata Marketing Tracking Event Service Tool Segmentation Tracking Validation TagTaxonomy Service Experimentation Catalog Big Data Reporting/Visualization Digital Metrics Attribution 4
  5. 5. METADATA - ENTITY MODEL LAYOUT PAGE ELEMENTS LINK COMPONENTS 5
  6. 6. METADATA - EVENT MODEL Tracking Event Impression Reaction Conversion Event Event EventComponent Page Ad Click Click-Through Mouse-overImpression Impression Impression Event Event Event Event Event Event Client Page Server Page Entry Exit Impression Impression Event Event Event Event 6
  7. 7. ATTRIBUTION MODEL Channel Impression Click Open Client ServerDirect ✓ ✓Organic Search ✓Paid Search ✓Display Offers ✓ ✓Onsite Offers ✓ ✓Transactional Emails ✓ ✓Marketing Emails ✓ ✓ 7
  8. 8. LOGICAL ARCHITECTURE Onsite Channels Marketing Channels Mobile Search Display Web Tracking Social Email Onsite Instrumentation Engine Advertising JS Marketing Marketing Marketing API Marketing Instrumentation Tracking Tracking Message Delivery Services Metadata Servers Service Tool Marketing Segmentation Active MQ Offers Service Producer Service Tracking Metadata Active Service MQ Hadoop Cluster Tracking Active MQ Active MQ Collector Consumer Consumer Customer Operational Intelligence Metrics Metadata TagRepository Catalog NAS Filer NAS Filer Behavioral Intelligence Reporting Aggregation/ Sessionization Identity Tracking Compression Bot Flagging Mapping Batch 8
  9. 9. DATA INGEST PIPELINE Raw Event PRE-PROCESS Map Reduce Map Reduce Gzip Text Deduped Enriched Validate/ Event Join Client & Event Dedup Events Gzip block Server Events Gzip block compressed compressed Raw Event SequenceFile SequenceFile Gzip Text CHAIN REDUCER SESSIONIZATION Map Reduce Mapper Mapper Enriched Sessionization Geo Lookup Bot Flagging Sessions Event Geo Bot Data/ Data Rules Map Reduce Map Reduce Behavioral ReportingGENERATION Sessions Stage 1 Stage 2 Metrics MySQLMETRICS Pig Enriched Event Adhoc Metrics
  10. 10. SESSIONIZATION Events VisitContainerVisitor Session Timestamp Event Visitor Session PayloadID ID Payload ID ID V1 S1 ie, winnt, {flash, quicktime}, V1 S1 2012-05-24 E1 {ca, usa}, 480 secs,…. 05:12 E1 V2 S2 2012-05-24 E2 05:14 E3 V1 S1 2012-05-24 E3 E4 05:15 V2 S2 ff, winxp, {acrobat, V1 S1 2012-05-24 E4 mediaplayer}. {wb, in}, 420 05:20 secs….. V2 S2 2012-05-24 E5 E2 05:21 E5 V1 S3 2012-05-24 E6 07:25 V1 S3 sf, macos, {quicktime, java}, {on, ca}, 60 secs V1 S3 2012-05-24 E7 07:26 E6 E7•  Chronologically sort events using secondary sort •  SortComparator on visitorid, sessionid and timestamp •  Partitioner & Grouping comparator on visitorid and sessionid•  Normalize data and store it against the session record 10 •  Browser, os, plugins, geo-location, duration, bot-flag etc.
  11. 11. DIMENSIONS & METRICS Dimension Metrics Page Visitors PageFlow Sessions Country Bounce Rate CountryRegion Page Views Plugins VisitDepth VisitDuration Time Period VisitByHour Hourly SearchEngine Daily OS Weekly Browser Monthly 11
  12. 12. METRICS GENERATION Mapper Input Mapper Output Reducer OutputVisitor Session Browser Key Value ID ID (visitorid, (#sessions) Key Value browser) (visitorid, (#sessions) browser) Compute V1 S1 IE V1,IE 1 V1,IE 2 sessions sorted V1 S2 IE V1,IE 1 by visitor, V2,FF 1 dimension V2 S3 FF V2,FF 1 STAGE 1 V3,IE 1 (browser) V3 S4 IE V3,IE 1 V4,FF 1 V4 S5 FF V4,FF 1 Mapper Input Mapper Output Key Value Key Value Reducer Output Compute(visitorid, (#sessions) (browser) (#sessions, metricsbrowser) #visitors) Key Value (browser) (#sessions, by #visitors) dimension V1,IE 2 IE 2,1 IE 4,3 V2,FF 1 IE 1,1 STAGE 2 FF 1,1 V3,IE 1 FF 1,1 V4,FF 1 IE 1,1 12
  13. 13. PIG – ADHOC QUERIES/* EventLoader - custom loader ; Exposes correct data-types using metadata for each field*/grunt> data = LOAD /paypal/event USING>> com.paypal.EventLoader(>> visitor_id, session_id, page_name, event_type, event_timestamp);grunt> describe data;data: {visitor_id: chararray, session_id: chararray, page_name: chararray,event_type: chararray, event_timestamp: long }grunt> events = FILTER data BY event_timestamp >= 1337583600000L andevent_timestamp < 1337587200000L;grunt> grouped = group events by (page_name, event_type) parallel 20;grunt> result = foreach grouped {>> visitors = distinct events.visitor_id;>> sessions = distinct events.session_id;>> generate group, COUNT(visitors), COUNT(sessions), COUNT(events);>> };grunt> dump result;((My Account Overview, im), 117875L,119343L,230216L)((mktg:xsell:merchant::home-inside, im), 462L,466L,655L) 13
  14. 14. PIG – ADHOC QUERIES/* VisitContainerLoader custom loader - Tuple ( Tuple, Bag (Tuple) )*/grunt> data = LOAD /paypal/visitcontainer>> USING com.paypal.VisitContainerLoader(>> {"visit":["visitor_id",”session_id","session_start", "session_end", "browser_type"],"events":["page_name", "event_type"]});grunt> describe data;data: {visit: (visitor_id: chararray, session_id: chararray, session_start: long, session_end:long, browser_type: chararray), events: {event: (page_name: chararray, event_type: chararray)}}grunt> flattened = foreach data generate FLATTEN(visit), FLATTEN(events);grunt> impression = FILTER flattened BY event_type == im and session_start >=1339045200000L and session_end < 1339063200000L;grunt> grouped = group impression by (page_name, browser_type) parallel 20;grunt> result = foreach grouped {>> visitors = distinct impression.visitor_id;>> sessions = distinct impression.session_id;>> generate group, COUNT(visitors), COUNT(sessions), COUNT(impression);>> };grunt> dump result;((Account History:Request Money Details, chrome), 522L,528L,726L) 14((Account History:Request Money Details, msie), 706L,716L,967L)
  15. 15. REPORTING 15
  16. 16. THANK YOUWe Are Hiring!•  San Jose•  Boston•  Bangalore•  Shanghai
  17. 17. Sessions will resume at 4:30pm Page 17

×