Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EAP - Accelerating behavorial analytics at PayPal using Hadoop

2,706 views

Published on

PayPal today generates massive amounts of data?from clickstream logs to transactions and routine business events. Analyzing customer behavior across this data can be a daunting task. Data Technology team at PayPal has built a configurable engine, Event Analytics Pipeline (EAP), using Hadoop to ingest and process massive amounts of customer interaction data, match business-defined behavioral patterns, and generate entities and interactions matching those patterns. The pipeline is an ecosystem of components built using HDFS, HBase, a data catalog, and seamless connectivity to enterprise data stores. EAP?s data definition, data processing, and behavioral analysis can be adapted to many business needs. Leveraging Hadoop to address the problems of size and scale, EAP promotes agility by abstracting the complexities of big-data technologies using a set of tools and metadata that allow end users to control the behavioral-centric processing of data. EAP abstracts the massive data stored on HDFS as business objects, e.g., customer and page impression events, allowing analysts to easily extract patterns of events across billions of rows of data. The rules system built using HBase allows analysts to define relationships between entities and extrapolate them across disparate data sources to truly explore the universe of customer interaction and behaviors through a single lens.

Published in: Technology, Education
  • Be the first to comment

EAP - Accelerating behavorial analytics at PayPal using Hadoop

  1. 1. 27th June, 2013 Accelerating Behavioral Analytics at PayPal @Hadoop Summit 2013 DATA | PLATFORM - EVENT ANALYTICS PLATFORM
  2. 2. Confidential and Proprietary2 Data Platform @PayPal Components Teams • Data Anywhere, Anytime, Anyplace! • Engine • Workspaces • Modules • Access • Rahul Bhartia: Product Owner • Alexei Vassiliev: Technical Lead INTRODUCTION
  3. 3. Confidential and Proprietary3 A COMMON LANDSCAPE OF DATA
  4. 4. Confidential and Proprietary4 Current business needs… • Detecting and preventing fraudRisk • Reaching customers with relevant offersMarketing • Improving user experience for better conversionProduct • Providing insights into their businessesMerchant
  5. 5. Confidential and Proprietary5 BEHAVIORAL ANALYTICS: DATA TO USE From Terabytes of raw data Clickstream, Transactions & logs Millions of rows To Metadata for business view Behaviors across channels Processors – Flows & Patterns • API Login • FRAUD Review • AUTH Confirm • TXN Shipment
  6. 6. Confidential and Proprietary6 SOURCES DEVELOPER EVENTS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework
  7. 7. Confidential and Proprietary7 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
  8. 8. Confidential and Proprietary8 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT EAP Event Example: { "id": "Impression384923561362839690709", "name": "Impression", "type": "Clickstream", "subtype": "Page", "filestream": "FPTI", "sourceDataHash": "38492356", "timestamp": "1362839690709", "creationTimestamp": "1363202683771", "updateTimestamp": "1363202715608", "attributes": { "attr1": "val1", "attr2": "val2" }, "entities": { "Customer": "12346326326", "Session": "3521651326" } }
  9. 9. Confidential and Proprietary9 SOURCES DEVELOPER EVENTS LIBRARY TAGS & RELATIONS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata
  10. 10. Confidential and Proprietary10 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
  11. 11. Confidential and Proprietary11 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
  12. 12. Confidential and Proprietary12 BUILDING A WORKFLOW : FINDING PATTERNS
  13. 13. Confidential and Proprietary13 FINDING A USER WORKFLOW
  14. 14. Confidential and Proprietary14 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS INPU T PROCESSING DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
  15. 15. Confidential and Proprietary15 EVENT ANALYTIC PLATFORM (EAP) - ARCHITECTURE
  16. 16. Confidential and Proprietary16 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation as sequence files Relations • Pre-Computed • Map-side joins using HBase Catalog • Metadata indexed HDFS repository Modules • Common interface for data access • Simplified logic
  17. 17. Confidential and Proprietary17 INPUT SUBSYSTEM – RAW DATA TO EVENTS SOURCES EVENTS DELIMITED SEQUENCE OTHERS Mapping & EntitiesDefinition MapReduce Data Catalog HDFS Hbase Reference Data Entity Relations
  18. 18. Confidential and Proprietary18 EVENT ANALYTIC PLATFORM (EAP) EAP Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Modules • Logical expressions transparent to sources Library • Business metadata as tags
  19. 19. Confidential and Proprietary19 ENRICHING THE DATA Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 ? 1362839790719 pageview 123456567 ? 1362839890729 pageview 123456567 7654321 Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 7654321 1362839790710 pageview 123456567 7654321 1362839890711 pageview 123456567 7654321 Reference Data Entity Relations Lookup HBASE Entity Resolver
  20. 20. Confidential and Proprietary20 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Catalog • Indexed access to all the Event data Modules • Common interface for data access • Simplified logic
  21. 21. Confidential and Proprietary21 Data Catalog API DATA CATALOG : ACCESS TO THE EVENTS HBASE Sequence Files MapReduce /Pig HDFS PROCESS METADATA
  22. 22. Confidential and Proprietary22 • PIG REGISTER ‘EventEngine.jar'; EVENTDATA = LOAD 'eap://event' USING com.paypal.eap.EventLoader(Time, Source,Events, Attributes, Entities); …. • MR Set<Path> paths = catalog.get(final Calendar startDate, final Calendar endDate, final String type) …. FlowMapper extends Mapper<Key, Event, OutputKey, OutputValue> { ACCESSING DATA
  23. 23. Confidential and Proprietary23 EVENT ANALYTIC PLATFORM (EAP): SUMMARY EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link across channel Catalog • Indexed access to all the Event data Modules • Simplified logic for Event processing
  24. 24. Confidential and Proprietary24 MODULE JOBS LIBRARY Path Discovery Pattern Matching Event Metrics Invoke Load PROCESSING SUBSYSTEM – EVENTS TO INFORMATION Data Catalog HDFS MR PIG
  25. 25. Confidential and Proprietary25 A QUICK LOOK: NUMBERS
  26. 26. Confidential and Proprietary26 EVENT ANALYTICS PLATFORM (EAP) : METRICS Cluster • Exploratory cluster: 600+ nodes • Production cluster: 600+ nodes Data (Current) • Daily (2): 300+GB • Hourly (1) : 50+ GB • Growing Everyday with more sources Processing (Sample) • Time:15 min • Events: 100+ M • Entity(HBase): 20M Jobs (User) • Flows: 5 Min/40+ M • Extract: 10 Min/200+ M
  27. 27. Confidential and Proprietary27 THANK YOU

×