EAP - Accelerating behavorial analytics at PayPal using Hadoop

2,154 views

Published on

PayPal today generates massive amounts of data?from clickstream logs to transactions and routine business events. Analyzing customer behavior across this data can be a daunting task. Data Technology team at PayPal has built a configurable engine, Event Analytics Pipeline (EAP), using Hadoop to ingest and process massive amounts of customer interaction data, match business-defined behavioral patterns, and generate entities and interactions matching those patterns. The pipeline is an ecosystem of components built using HDFS, HBase, a data catalog, and seamless connectivity to enterprise data stores. EAP?s data definition, data processing, and behavioral analysis can be adapted to many business needs. Leveraging Hadoop to address the problems of size and scale, EAP promotes agility by abstracting the complexities of big-data technologies using a set of tools and metadata that allow end users to control the behavioral-centric processing of data. EAP abstracts the massive data stored on HDFS as business objects, e.g., customer and page impression events, allowing analysts to easily extract patterns of events across billions of rows of data. The rules system built using HBase allows analysts to define relationships between entities and extrapolate them across disparate data sources to truly explore the universe of customer interaction and behaviors through a single lens.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,154
On SlideShare
0
From Embeds
0
Number of Embeds
241
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Each company uses data in its own ways. Here are just some of the ways in which PayPal leverages its big data.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  • The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  • The processing subsystem is where we expect most usage. It’s where problem solvers across the company setup their own use cases using our modules that operate on the object library.
  • EAP - Accelerating behavorial analytics at PayPal using Hadoop

    1. 1. 27th June, 2013 Accelerating Behavioral Analytics at PayPal @Hadoop Summit 2013 DATA | PLATFORM - EVENT ANALYTICS PLATFORM
    2. 2. Confidential and Proprietary2 Data Platform @PayPal Components Teams • Data Anywhere, Anytime, Anyplace! • Engine • Workspaces • Modules • Access • Rahul Bhartia: Product Owner • Alexei Vassiliev: Technical Lead INTRODUCTION
    3. 3. Confidential and Proprietary3 A COMMON LANDSCAPE OF DATA
    4. 4. Confidential and Proprietary4 Current business needs… • Detecting and preventing fraudRisk • Reaching customers with relevant offersMarketing • Improving user experience for better conversionProduct • Providing insights into their businessesMerchant
    5. 5. Confidential and Proprietary5 BEHAVIORAL ANALYTICS: DATA TO USE From Terabytes of raw data Clickstream, Transactions & logs Millions of rows To Metadata for business view Behaviors across channels Processors – Flows & Patterns • API Login • FRAUD Review • AUTH Confirm • TXN Shipment
    6. 6. Confidential and Proprietary6 SOURCES DEVELOPER EVENTS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework
    7. 7. Confidential and Proprietary7 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
    8. 8. Confidential and Proprietary8 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT EAP Event Example: { "id": "Impression384923561362839690709", "name": "Impression", "type": "Clickstream", "subtype": "Page", "filestream": "FPTI", "sourceDataHash": "38492356", "timestamp": "1362839690709", "creationTimestamp": "1363202683771", "updateTimestamp": "1363202715608", "attributes": { "attr1": "val1", "attr2": "val2" }, "entities": { "Customer": "12346326326", "Session": "3521651326" } }
    9. 9. Confidential and Proprietary9 SOURCES DEVELOPER EVENTS LIBRARY TAGS & RELATIONS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata
    10. 10. Confidential and Proprietary10 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
    11. 11. Confidential and Proprietary11 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
    12. 12. Confidential and Proprietary12 BUILDING A WORKFLOW : FINDING PATTERNS
    13. 13. Confidential and Proprietary13 FINDING A USER WORKFLOW
    14. 14. Confidential and Proprietary14 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS INPU T PROCESSING DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
    15. 15. Confidential and Proprietary15 EVENT ANALYTIC PLATFORM (EAP) - ARCHITECTURE
    16. 16. Confidential and Proprietary16 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation as sequence files Relations • Pre-Computed • Map-side joins using HBase Catalog • Metadata indexed HDFS repository Modules • Common interface for data access • Simplified logic
    17. 17. Confidential and Proprietary17 INPUT SUBSYSTEM – RAW DATA TO EVENTS SOURCES EVENTS DELIMITED SEQUENCE OTHERS Mapping & EntitiesDefinition MapReduce Data Catalog HDFS Hbase Reference Data Entity Relations
    18. 18. Confidential and Proprietary18 EVENT ANALYTIC PLATFORM (EAP) EAP Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Modules • Logical expressions transparent to sources Library • Business metadata as tags
    19. 19. Confidential and Proprietary19 ENRICHING THE DATA Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 ? 1362839790719 pageview 123456567 ? 1362839890729 pageview 123456567 7654321 Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 7654321 1362839790710 pageview 123456567 7654321 1362839890711 pageview 123456567 7654321 Reference Data Entity Relations Lookup HBASE Entity Resolver
    20. 20. Confidential and Proprietary20 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Catalog • Indexed access to all the Event data Modules • Common interface for data access • Simplified logic
    21. 21. Confidential and Proprietary21 Data Catalog API DATA CATALOG : ACCESS TO THE EVENTS HBASE Sequence Files MapReduce /Pig HDFS PROCESS METADATA
    22. 22. Confidential and Proprietary22 • PIG REGISTER ‘EventEngine.jar'; EVENTDATA = LOAD 'eap://event' USING com.paypal.eap.EventLoader(Time, Source,Events, Attributes, Entities); …. • MR Set<Path> paths = catalog.get(final Calendar startDate, final Calendar endDate, final String type) …. FlowMapper extends Mapper<Key, Event, OutputKey, OutputValue> { ACCESSING DATA
    23. 23. Confidential and Proprietary23 EVENT ANALYTIC PLATFORM (EAP): SUMMARY EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link across channel Catalog • Indexed access to all the Event data Modules • Simplified logic for Event processing
    24. 24. Confidential and Proprietary24 MODULE JOBS LIBRARY Path Discovery Pattern Matching Event Metrics Invoke Load PROCESSING SUBSYSTEM – EVENTS TO INFORMATION Data Catalog HDFS MR PIG
    25. 25. Confidential and Proprietary25 A QUICK LOOK: NUMBERS
    26. 26. Confidential and Proprietary26 EVENT ANALYTICS PLATFORM (EAP) : METRICS Cluster • Exploratory cluster: 600+ nodes • Production cluster: 600+ nodes Data (Current) • Daily (2): 300+GB • Hourly (1) : 50+ GB • Growing Everyday with more sources Processing (Sample) • Time:15 min • Events: 100+ M • Entity(HBase): 20M Jobs (User) • Flows: 5 Min/40+ M • Extract: 10 Min/200+ M
    27. 27. Confidential and Proprietary27 THANK YOU

    ×