Hadoop Graph Processing with Apache Giraph

June, 2013
Jay Tang
GRAPH MINING WITH APACHE
GIRAPH

Confidential and Proprietary2
• Introduction
• Big Data problem
• Graph mining platform
• Use case
• Lessons
• Future work
AGENDA

• Director of Big Data Platform & Analytics, PayPal
− Hadoop, Graph mining, Real-time analytics, ML, text mining
• 20 years of software experience in the valley focused on data
• Member of original Hadoop team @Yahoo
• Built data warehouse, relational database, OLAP product
@Yahoo, Oracle/Hyperion, IBM Informix, DB2
ABOUT ME

BIG DATA PROBLEM

• Enable Online, Offline, and Mobile payment
• 128M customers worldwide
• $160B payment volume processed annually
• Major retail locations accepting PayPal
20K today  2M end of 2013
• PayPal Here launching in US and international markets
Petabye Data Problem & Growing
BIG DATA PROBLEM @ PAYPAL

• Detect and prevent fraud
• Assess credit risk
• Relevant offer to our customers
• Improve user experience
• Provide better insights to our merchants
BIG DATA POWERS PAYPAL ANALYTICS

GRAPH MINING PLATFORM

BIG DATA STACK
Data
Cloud

Traditional data processing abstraction -- TABLE
• Rows
• Columns
• Data Types
DATA ABSTRACTION

• Internet & WWW
• Social network
• PayPal payment network – accounts & transactions
GRAPH IS EVERYWHERE

• Think like a vertex
• Two basic operations
− Fusion: aggregate information from neighbors to a set of entities
− Diffusion: propagate information from a vertex to neighbors
GRAPH COMPUTING

THING LIKE A VERTEX - FUSION

THINK LIKE A VERTEX - DIFFUSION

• Which graph mining engine to use?
− GraphLab
− Apache Giraph
− Apache Hamas
• Hadoop compatible
− Data is on Hadoop
− Leverage existing cluster infrastructure
− Integration with Hadoop
• Easy of deployment and update
• Community
GRAPH MINING ENGINE

• Apache open src implementation of Google Pregel on Hadoop
• Send msg from a vertex to any other vertex
• In-memory scalable system
− Map-only jobs, Zookeeper, Netty
BSP & GIRAPH

GRAPH MINING USE CASE

• Stop fraudsters from stealing money from PayPal payment
network
• Sophisticate risk models running in real-time based on
− Online data
− Offline data
• Risk profile traditionally based on a variety of data
− Account
− Transaction -- frequency, amount, history
− IP
− Email domain
RISK DETECTION & MITIGATION

RISK COMPUTATION
Current TX Details
Risk Models
Approve
DeclineHistory Data

• PayPal data are connected
• Form multiple communities that have hidden inferences
• Discover the inferences via a graph approach
• Build a system to extract the inferences
GRAPH MINING CONNECTED DATA

GRAPH VIEW OF DATA
User1
User2
Merchant
BUY
BUY
P2P Money
Transfer

GRAPH VIEW OF DATA
Account 1
IP1 IP2
Account 2
IP3

GRAPH MINING DATA PIPELINE
Pre
Processing
Graph Processing
Post
Processing

• Input data is raw transaction data
• Custom MapReduce jobs to pre-process data into graph
model
• Output is JSON format of adjacent node list
− Easy to consume in Java and by humans
− Use gson library
• Post processing – output format conversion
GRAPH DATA PIPELINE

• Customers/Accounts linked via transactions
• Compute risk = intrinsic risk + risk propagated from peers
• Send risk message to peers
• Iterate till converge
GRAPH PROCESSING
Cus1
Cus2
Transaction T1
Transaction T0
Transaction T2
Transaction T3

IP3
IP2
GRAPH PROCESSING
Account 1
IP1 IP2
Account 2
IP3IP1

LESSONS LEARNED

• Giraph is an emerging technology
− Incubation in 2012
− Rapidly evolving
− 0.1 and 0.2 are not compatible
− Lack of knowledge & doc
• Build internal git repo
• Read code and join mailing list
• Port code from 0.1 to 0.2
• Use Giraph 1.0 released on May 6 2013
GIRAPH

• Must guarantee minimum number of Mappers
• Capacity scheduler
− set MIN mapper of queue > Giraph job needs
• Fair scheduler
− set MIN mapper of queue > Giraph job needs
− Turn on pre-emption
− Set pre-emption wait time to a small interval – 20 sec
HADOOP ENVIRONMENT INTEGRATION

• Memory constraint in a shared Hadoop environment
− 1.2B edges and 300M nodes
− Single purpose POC cluster mapper memory = 10 GB
− Shared R&D cluster mapper memory = 3 GB
• Reduce memory consumption is key
− Convert String to long for graph processing
− Convert back to String in post-processing for downstream application
− Cap the number of messages passed
− distance from current vertex
− message payload data values
MEMORY SCALABILITY

• Giraph-based data engine to produce enriched data set
• Leverage Giraph on YARN
• Number of worker scalability
FUTURE WORK

Hadoop Graph Processing with Apache Giraph

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Hadoop Graph Processing with Apache Giraph

Similar to Hadoop Graph Processing with Apache Giraph (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop Graph Processing with Apache Giraph

Editor's Notes