• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
EAP - Accelerating behavorial analytics at PayPal using Hadoop
 

EAP - Accelerating behavorial analytics at PayPal using Hadoop

on

  • 1,240 views

PayPal today generates massive amounts of data?from clickstream logs to transactions and routine business events. Analyzing customer behavior across this data can be a daunting task. Data Technology ...

PayPal today generates massive amounts of data?from clickstream logs to transactions and routine business events. Analyzing customer behavior across this data can be a daunting task. Data Technology team at PayPal has built a configurable engine, Event Analytics Pipeline (EAP), using Hadoop to ingest and process massive amounts of customer interaction data, match business-defined behavioral patterns, and generate entities and interactions matching those patterns. The pipeline is an ecosystem of components built using HDFS, HBase, a data catalog, and seamless connectivity to enterprise data stores. EAP?s data definition, data processing, and behavioral analysis can be adapted to many business needs. Leveraging Hadoop to address the problems of size and scale, EAP promotes agility by abstracting the complexities of big-data technologies using a set of tools and metadata that allow end users to control the behavioral-centric processing of data. EAP abstracts the massive data stored on HDFS as business objects, e.g., customer and page impression events, allowing analysts to easily extract patterns of events across billions of rows of data. The rules system built using HBase allows analysts to define relationships between entities and extrapolate them across disparate data sources to truly explore the universe of customer interaction and behaviors through a single lens.

Statistics

Views

Total Views
1,240
Views on SlideShare
1,070
Embed Views
170

Actions

Likes
1
Downloads
0
Comments
0

3 Embeds 170

http://www.datanami.com 164
https://www.linkedin.com 4
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Each company uses data in its own ways. Here are just some of the ways in which PayPal leverages its big data.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  • The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  • The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  • The processing subsystem is where we expect most usage. It’s where problem solvers across the company setup their own use cases using our modules that operate on the object library.

EAP - Accelerating behavorial analytics at PayPal using Hadoop EAP - Accelerating behavorial analytics at PayPal using Hadoop Presentation Transcript

  • 27th June, 2013 Accelerating Behavioral Analytics at PayPal @Hadoop Summit 2013 DATA | PLATFORM - EVENT ANALYTICS PLATFORM
  • Confidential and Proprietary2 Data Platform @PayPal Components Teams • Data Anywhere, Anytime, Anyplace! • Engine • Workspaces • Modules • Access • Rahul Bhartia: Product Owner • Alexei Vassiliev: Technical Lead INTRODUCTION
  • Confidential and Proprietary3 A COMMON LANDSCAPE OF DATA
  • Confidential and Proprietary4 Current business needs… • Detecting and preventing fraudRisk • Reaching customers with relevant offersMarketing • Improving user experience for better conversionProduct • Providing insights into their businessesMerchant
  • Confidential and Proprietary5 BEHAVIORAL ANALYTICS: DATA TO USE From Terabytes of raw data Clickstream, Transactions & logs Millions of rows To Metadata for business view Behaviors across channels Processors – Flows & Patterns • API Login • FRAUD Review • AUTH Confirm • TXN Shipment
  • Confidential and Proprietary6 SOURCES DEVELOPER EVENTS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework
  • Confidential and Proprietary7 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
  • Confidential and Proprietary8 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT EAP Event Example: { "id": "Impression384923561362839690709", "name": "Impression", "type": "Clickstream", "subtype": "Page", "filestream": "FPTI", "sourceDataHash": "38492356", "timestamp": "1362839690709", "creationTimestamp": "1363202683771", "updateTimestamp": "1363202715608", "attributes": { "attr1": "val1", "attr2": "val2" }, "entities": { "Customer": "12346326326", "Session": "3521651326" } }
  • Confidential and Proprietary9 SOURCES DEVELOPER EVENTS LIBRARY TAGS & RELATIONS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata
  • Confidential and Proprietary10 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
  • Confidential and Proprietary11 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
  • Confidential and Proprietary12 BUILDING A WORKFLOW : FINDING PATTERNS
  • Confidential and Proprietary13 FINDING A USER WORKFLOW
  • Confidential and Proprietary14 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS INPU T PROCESSING DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
  • Confidential and Proprietary15 EVENT ANALYTIC PLATFORM (EAP) - ARCHITECTURE
  • Confidential and Proprietary16 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation as sequence files Relations • Pre-Computed • Map-side joins using HBase Catalog • Metadata indexed HDFS repository Modules • Common interface for data access • Simplified logic
  • Confidential and Proprietary17 INPUT SUBSYSTEM – RAW DATA TO EVENTS SOURCES EVENTS DELIMITED SEQUENCE OTHERS Mapping & EntitiesDefinition MapReduce Data Catalog HDFS Hbase Reference Data Entity Relations
  • Confidential and Proprietary18 EVENT ANALYTIC PLATFORM (EAP) EAP Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Modules • Logical expressions transparent to sources Library • Business metadata as tags
  • Confidential and Proprietary19 ENRICHING THE DATA Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 ? 1362839790719 pageview 123456567 ? 1362839890729 pageview 123456567 7654321 Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 7654321 1362839790710 pageview 123456567 7654321 1362839890711 pageview 123456567 7654321 Reference Data Entity Relations Lookup HBASE Entity Resolver
  • Confidential and Proprietary20 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Catalog • Indexed access to all the Event data Modules • Common interface for data access • Simplified logic
  • Confidential and Proprietary21 Data Catalog API DATA CATALOG : ACCESS TO THE EVENTS HBASE Sequence Files MapReduce /Pig HDFS PROCESS METADATA
  • Confidential and Proprietary22 • PIG REGISTER ‘EventEngine.jar'; EVENTDATA = LOAD 'eap://event' USING com.paypal.eap.EventLoader(Time, Source,Events, Attributes, Entities); …. • MR Set<Path> paths = catalog.get(final Calendar startDate, final Calendar endDate, final String type) …. FlowMapper extends Mapper<Key, Event, OutputKey, OutputValue> { ACCESSING DATA
  • Confidential and Proprietary23 EVENT ANALYTIC PLATFORM (EAP): SUMMARY EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link across channel Catalog • Indexed access to all the Event data Modules • Simplified logic for Event processing
  • Confidential and Proprietary24 MODULE JOBS LIBRARY Path Discovery Pattern Matching Event Metrics Invoke Load PROCESSING SUBSYSTEM – EVENTS TO INFORMATION Data Catalog HDFS MR PIG
  • Confidential and Proprietary25 A QUICK LOOK: NUMBERS
  • Confidential and Proprietary26 EVENT ANALYTICS PLATFORM (EAP) : METRICS Cluster • Exploratory cluster: 600+ nodes • Production cluster: 600+ nodes Data (Current) • Daily (2): 300+GB • Hourly (1) : 50+ GB • Growing Everyday with more sources Processing (Sample) • Time:15 min • Events: 100+ M • Entity(HBase): 20M Jobs (User) • Flows: 5 Min/40+ M • Extract: 10 Min/200+ M
  • Confidential and Proprietary27 THANK YOU