PAYPAL - BEHAVIORAL TRACKING ON HADOOP

ANIL MADAN
DIRECTOR OF ENGINEERING , MARKETING & ANALYTICS
PAYPAL'S VISION




           Delivering the future of money today…
   An essential part of our customer's financial and business
lives, enabling secure commerce anywhere, anytime, any way

     110 million active accounts , 190 markets , 25 currencies
                                                                 2
BEHAVIORAL TRACKING VISION
  Understand our      anytime, anywhere, any way       to drive desirable
customer’s behavior                                    outcomes for our
  and experience                                   customers and for PayPal.




                                                      Enable self-service
                                                      analytics for our
                                                      product and
                            Ensure                    marketing teams
Ensure privacy,
                            instrumentation
security and trust
                            standardization                                     3
for our customers
                            across channels
                                                                            3
TRACKING PLATFORM OVERVIEW


    Direct/            Transaction          Email            Display             Search
   Home Page              Emails           Marketing       Advertising           Engine
                                                                                Marketing




   Metadata                     Tracking Servers                   Real Time Systems
  Tracking Metadata                                                         Marketing
                                Tracking Event Service
         Tool
                                                                          Segmentation
                                     Tracking Validation
              Tag
Taxonomy                                  Service                        Experimentation
             Catalog


                              Big Data
 Reporting/Visualization               Digital Metrics                    Attribution
                                                                                            4
METADATA - ENTITY MODEL

 LAYOUT                   PAGE




   ELEMENTS                      LINK



   COMPONENTS


                                    5
METADATA - EVENT MODEL


                                      Tracking
                                       Event




                  Impression                                 Reaction                Conversion
                    Event                                     Event                    Event



Component       Page              Ad             Click       Click-Through     Mouse-over
Impression    Impression       Impression        Event           Event           Event
  Event         Event            Event


    Client Page            Server Page                   Entry                Exit
    Impression              Impression                   Event               Event
       Event                  Event


                                                                                             6
ATTRIBUTION MODEL

         Channel           Impression       Click   Open
                       Client      Server
Direct                   ✓          ✓
Organic Search           ✓
Paid Search                                  ✓
Display Offers                      ✓        ✓
Onsite Offers                       ✓        ✓

Transactional Emails                         ✓       ✓
Marketing Emails                             ✓       ✓




                                                           7
LOGICAL ARCHITECTURE
             Onsite Channels                                          Marketing Channels

                        Mobile                Search                                                    Display
  Web Tracking                                                 Social       Email          Onsite
                   Instrumentation            Engine                                                   Advertising
      JS                                                      Marketing    Marketing      Marketing
                         API                 Marketing
                                            Instrumentation


                               Tracking            Tracking                            Message Delivery Services
      Metadata                 Servers             Service
        Tool                                                                                          Marketing
                                                                                  Segmentation
                                                  Active MQ                                            Offers
                                                                                    Service
                                                  Producer                                             Service

      Tracking
      Metadata                                    Active
      Service                                      MQ                                       Hadoop Cluster



                               Tracking    Active MQ           Active MQ
                               Collector   Consumer            Consumer
                                                                                    Customer          Operational
                                                                                   Intelligence        Metrics
 Metadata      Tag
Repository    Catalog                      NAS Filer          NAS Filer             Behavioral
                                                                                   Intelligence       Reporting

                                                 Aggregation/                     Sessionization       Identity
                                Tracking
                                                 Compression                       Bot Flagging        Mapping
                                Batch
                                                                                                          8
DATA INGEST PIPELINE

                   Raw Event
  PRE-PROCESS




                                    Map Reduce                                    Map Reduce
                    Gzip Text
                                                               Deduped                              Enriched
                                      Validate/                 Event             Join Client &      Event
                                    Dedup Events               Gzip block         Server Events     Gzip block
                                                              compressed                           compressed
                   Raw Event
                                                             SequenceFile                         SequenceFile
                    Gzip Text


                                                             CHAIN REDUCER
  SESSIONIZATION




                                     Map Reduce                 Mapper              Mapper

                    Enriched         Sessionization             Geo Lookup         Bot Flagging     Sessions
                     Event


                                                                   Geo               Bot Data/
                                                                   Data               Rules

                                  Map Reduce       Map Reduce

                                                                     Behavioral       Reporting
GENERATION




                    Sessions        Stage 1           Stage 2
                                                                      Metrics          MySQL
METRICS




                                                       Pig
                    Enriched
                     Event                         Adhoc Metrics
SESSIONIZATION
                       Events                                                        VisitContainer
Visitor      Session     Timestamp        Event                Visitor     Session                 Payload
ID           ID                           Payload              ID          ID

                                                                  V1          S1      ie, winnt, {flash, quicktime},
   V1            S1      2012-05-24           E1
                                                                                      {ca, usa}, 480 secs,….
                         05:12
                                                                                                     E1
   V2            S2      2012-05-24           E2
                         05:14                                                                       E3
   V1            S1      2012-05-24           E3                                                     E4
                         05:15
                                                                  V2          S2      ff, winxp, {acrobat,
   V1            S1      2012-05-24           E4                                      mediaplayer}. {wb, in}, 420
                         05:20                                                        secs…..
   V2            S2      2012-05-24           E5                                                     E2
                         05:21
                                                                                                     E5
   V1            S3      2012-05-24           E6
                         07:25                                    V1          S3      sf, macos, {quicktime, java},
                                                                                      {on, ca}, 60 secs
   V1            S3      2012-05-24           E7
                         07:26                                                                       E6
                                                                                                     E7
•  Chronologically sort events using secondary sort
        •  SortComparator on visitorid, sessionid and timestamp
        •    Partitioner & Grouping comparator on visitorid and sessionid
•  Normalize data and store it against the session record                                                              10
        •    Browser, os, plugins, geo-location, duration, bot-flag etc.
DIMENSIONS & METRICS

    Dimension          Metrics
  Page            Visitors
  PageFlow        Sessions
  Country         Bounce Rate
  CountryRegion   Page Views
  Plugins
  VisitDepth
  VisitDuration     Time Period
  VisitByHour     Hourly
  SearchEngine    Daily
  OS              Weekly
  Browser         Monthly


                                  11
METRICS GENERATION
          Mapper Input                   Mapper Output
                                                                             Reducer Output
Visitor     Session       Browser      Key           Value
  ID          ID                    (visitorid,   (#sessions)                Key           Value
                                    browser)                              (visitorid,   (#sessions)
                                                                          browser)                    Compute
  V1          S1            IE        V1,IE            1
                                                                            V1,IE           2
                                                                                                      sessions sorted
  V1          S2            IE        V1,IE            1                                              by visitor,
                                                                            V2,FF           1         dimension
  V2          S3            FF        V2,FF            1        STAGE 1
                                                                            V3,IE           1         (browser)
  V3          S4            IE        V3,IE            1
                                                                            V4,FF           1
  V4          S5            FF        V4,FF            1



       Mapper Input                  Mapper Output

   Key            Value             Key              Value                   Reducer Output           Compute
(visitorid,    (#sessions)       (browser)        (#sessions,                                         metrics
browser)                                           #visitors)                Key           Value
                                                                          (browser)     (#sessions,
                                                                                                      by
                                                                                         #visitors)   dimension
  V1,IE               2             IE                2,1
                                                                              IE            4,3
  V2,FF               1             IE                1,1
                                                                STAGE 2       FF            1,1
  V3,IE               1             FF                1,1

  V4,FF               1             IE                1,1                                                         12
PIG – ADHOC QUERIES
/* EventLoader - custom loader ; Exposes correct data-types using metadata for each field*/

grunt> data = LOAD '/paypal/event' USING
>> com.paypal.EventLoader(
>> 'visitor_id, session_id, page_name, event_type, event_timestamp');

grunt> describe data;
data: {visitor_id: chararray, session_id: chararray, page_name: chararray,
event_type: chararray, event_timestamp: long }

grunt> events = FILTER data BY event_timestamp >= 1337583600000L and
event_timestamp < 1337587200000L;

grunt> grouped = group events by (page_name, event_type) parallel 20;
grunt> result = foreach grouped {
>>      visitors = distinct events.visitor_id;
>>      sessions = distinct events.session_id;
>>      generate group, COUNT(visitors), COUNT(sessions), COUNT(events);
>> };

grunt> dump result;
((My Account Overview, im), 117875L,119343L,230216L)
((mktg:xsell:merchant::home-inside, im), 462L,466L,655L)                                      13
PIG – ADHOC QUERIES
/* VisitContainerLoader custom loader - Tuple ( Tuple, Bag (Tuple) )*/

grunt> data = LOAD '/paypal/visitcontainer'
>> USING com.paypal.VisitContainerLoader(
>> '{"visit":["visitor_id",”session_id","session_start", "session_end", "browser_type"],
"events":["page_name", "event_type"]}');

grunt> describe data;
data: {visit: (visitor_id: chararray, session_id: chararray, session_start: long, session_end:
long, browser_type: chararray),
        events: {event: (page_name: chararray, event_type: chararray)}}

grunt> flattened = foreach data generate FLATTEN(visit), FLATTEN(events);
grunt> impression = FILTER flattened BY event_type == 'im' and session_start >=
1339045200000L and session_end < 1339063200000L;
grunt> grouped = group impression by (page_name, browser_type) parallel 20;
grunt> result = foreach grouped {
>> visitors = distinct impression.visitor_id;
>> sessions = distinct impression.session_id;
>> generate group, COUNT(visitors), COUNT(sessions), COUNT(impression);
>> };

grunt> dump result;
((Account History:Request Money Details, chrome), 522L,528L,726L)
                                                                                                 14
((Account History:Request Money Details, msie), 706L,716L,967L)
REPORTING




            15
THANK YOU


We Are Hiring!
•  San Jose
•  Boston
•  Bangalore
•  Shanghai
Sessions will resume at 4:30pm




                             Page 17

PayPal Behavioral Analytics on Hadoop

  • 1.
    PAYPAL - BEHAVIORALTRACKING ON HADOOP ANIL MADAN DIRECTOR OF ENGINEERING , MARKETING & ANALYTICS
  • 2.
    PAYPAL'S VISION Delivering the future of money today… An essential part of our customer's financial and business lives, enabling secure commerce anywhere, anytime, any way 110 million active accounts , 190 markets , 25 currencies 2
  • 3.
    BEHAVIORAL TRACKING VISION Understand our anytime, anywhere, any way to drive desirable customer’s behavior outcomes for our and experience customers and for PayPal. Enable self-service analytics for our product and Ensure marketing teams Ensure privacy, instrumentation security and trust standardization 3 for our customers across channels 3
  • 4.
    TRACKING PLATFORM OVERVIEW Direct/ Transaction Email Display Search Home Page Emails Marketing Advertising Engine Marketing Metadata Tracking Servers Real Time Systems Tracking Metadata Marketing Tracking Event Service Tool Segmentation Tracking Validation Tag Taxonomy Service Experimentation Catalog Big Data Reporting/Visualization Digital Metrics Attribution 4
  • 5.
    METADATA - ENTITYMODEL LAYOUT PAGE ELEMENTS LINK COMPONENTS 5
  • 6.
    METADATA - EVENTMODEL Tracking Event Impression Reaction Conversion Event Event Event Component Page Ad Click Click-Through Mouse-over Impression Impression Impression Event Event Event Event Event Event Client Page Server Page Entry Exit Impression Impression Event Event Event Event 6
  • 7.
    ATTRIBUTION MODEL Channel Impression Click Open Client Server Direct ✓ ✓ Organic Search ✓ Paid Search ✓ Display Offers ✓ ✓ Onsite Offers ✓ ✓ Transactional Emails ✓ ✓ Marketing Emails ✓ ✓ 7
  • 8.
    LOGICAL ARCHITECTURE Onsite Channels Marketing Channels Mobile Search Display Web Tracking Social Email Onsite Instrumentation Engine Advertising JS Marketing Marketing Marketing API Marketing Instrumentation Tracking Tracking Message Delivery Services Metadata Servers Service Tool Marketing Segmentation Active MQ Offers Service Producer Service Tracking Metadata Active Service MQ Hadoop Cluster Tracking Active MQ Active MQ Collector Consumer Consumer Customer Operational Intelligence Metrics Metadata Tag Repository Catalog NAS Filer NAS Filer Behavioral Intelligence Reporting Aggregation/ Sessionization Identity Tracking Compression Bot Flagging Mapping Batch 8
  • 9.
    DATA INGEST PIPELINE Raw Event PRE-PROCESS Map Reduce Map Reduce Gzip Text Deduped Enriched Validate/ Event Join Client & Event Dedup Events Gzip block Server Events Gzip block compressed compressed Raw Event SequenceFile SequenceFile Gzip Text CHAIN REDUCER SESSIONIZATION Map Reduce Mapper Mapper Enriched Sessionization Geo Lookup Bot Flagging Sessions Event Geo Bot Data/ Data Rules Map Reduce Map Reduce Behavioral Reporting GENERATION Sessions Stage 1 Stage 2 Metrics MySQL METRICS Pig Enriched Event Adhoc Metrics
  • 10.
    SESSIONIZATION Events VisitContainer Visitor Session Timestamp Event Visitor Session Payload ID ID Payload ID ID V1 S1 ie, winnt, {flash, quicktime}, V1 S1 2012-05-24 E1 {ca, usa}, 480 secs,…. 05:12 E1 V2 S2 2012-05-24 E2 05:14 E3 V1 S1 2012-05-24 E3 E4 05:15 V2 S2 ff, winxp, {acrobat, V1 S1 2012-05-24 E4 mediaplayer}. {wb, in}, 420 05:20 secs….. V2 S2 2012-05-24 E5 E2 05:21 E5 V1 S3 2012-05-24 E6 07:25 V1 S3 sf, macos, {quicktime, java}, {on, ca}, 60 secs V1 S3 2012-05-24 E7 07:26 E6 E7 •  Chronologically sort events using secondary sort •  SortComparator on visitorid, sessionid and timestamp •  Partitioner & Grouping comparator on visitorid and sessionid •  Normalize data and store it against the session record 10 •  Browser, os, plugins, geo-location, duration, bot-flag etc.
  • 11.
    DIMENSIONS & METRICS Dimension Metrics Page Visitors PageFlow Sessions Country Bounce Rate CountryRegion Page Views Plugins VisitDepth VisitDuration Time Period VisitByHour Hourly SearchEngine Daily OS Weekly Browser Monthly 11
  • 12.
    METRICS GENERATION Mapper Input Mapper Output Reducer Output Visitor Session Browser Key Value ID ID (visitorid, (#sessions) Key Value browser) (visitorid, (#sessions) browser) Compute V1 S1 IE V1,IE 1 V1,IE 2 sessions sorted V1 S2 IE V1,IE 1 by visitor, V2,FF 1 dimension V2 S3 FF V2,FF 1 STAGE 1 V3,IE 1 (browser) V3 S4 IE V3,IE 1 V4,FF 1 V4 S5 FF V4,FF 1 Mapper Input Mapper Output Key Value Key Value Reducer Output Compute (visitorid, (#sessions) (browser) (#sessions, metrics browser) #visitors) Key Value (browser) (#sessions, by #visitors) dimension V1,IE 2 IE 2,1 IE 4,3 V2,FF 1 IE 1,1 STAGE 2 FF 1,1 V3,IE 1 FF 1,1 V4,FF 1 IE 1,1 12
  • 13.
    PIG – ADHOCQUERIES /* EventLoader - custom loader ; Exposes correct data-types using metadata for each field*/ grunt> data = LOAD '/paypal/event' USING >> com.paypal.EventLoader( >> 'visitor_id, session_id, page_name, event_type, event_timestamp'); grunt> describe data; data: {visitor_id: chararray, session_id: chararray, page_name: chararray, event_type: chararray, event_timestamp: long } grunt> events = FILTER data BY event_timestamp >= 1337583600000L and event_timestamp < 1337587200000L; grunt> grouped = group events by (page_name, event_type) parallel 20; grunt> result = foreach grouped { >> visitors = distinct events.visitor_id; >> sessions = distinct events.session_id; >> generate group, COUNT(visitors), COUNT(sessions), COUNT(events); >> }; grunt> dump result; ((My Account Overview, im), 117875L,119343L,230216L) ((mktg:xsell:merchant::home-inside, im), 462L,466L,655L) 13
  • 14.
    PIG – ADHOCQUERIES /* VisitContainerLoader custom loader - Tuple ( Tuple, Bag (Tuple) )*/ grunt> data = LOAD '/paypal/visitcontainer' >> USING com.paypal.VisitContainerLoader( >> '{"visit":["visitor_id",”session_id","session_start", "session_end", "browser_type"], "events":["page_name", "event_type"]}'); grunt> describe data; data: {visit: (visitor_id: chararray, session_id: chararray, session_start: long, session_end: long, browser_type: chararray), events: {event: (page_name: chararray, event_type: chararray)}} grunt> flattened = foreach data generate FLATTEN(visit), FLATTEN(events); grunt> impression = FILTER flattened BY event_type == 'im' and session_start >= 1339045200000L and session_end < 1339063200000L; grunt> grouped = group impression by (page_name, browser_type) parallel 20; grunt> result = foreach grouped { >> visitors = distinct impression.visitor_id; >> sessions = distinct impression.session_id; >> generate group, COUNT(visitors), COUNT(sessions), COUNT(impression); >> }; grunt> dump result; ((Account History:Request Money Details, chrome), 522L,528L,726L) 14 ((Account History:Request Money Details, msie), 706L,716L,967L)
  • 15.
  • 16.
    THANK YOU We AreHiring! •  San Jose •  Boston •  Bangalore •  Shanghai
  • 17.
    Sessions will resumeat 4:30pm Page 17