Hadoop Graph Processing with Apache Giraph

8,350
-1

Published on

PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.

Published in: Technology, Economy & Finance

Hadoop Graph Processing with Apache Giraph

  1. 1. June, 2013 Jay Tang GRAPH MINING WITH APACHE GIRAPH
  2. 2. Confidential and Proprietary2 • Introduction • Big Data problem • Graph mining platform • Use case • Lessons • Future work AGENDA
  3. 3. Confidential and Proprietary3 • Director of Big Data Platform & Analytics, PayPal − Hadoop, Graph mining, Real-time analytics, ML, text mining • 20 years of software experience in the valley focused on data • Member of original Hadoop team @Yahoo • Built data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2 ABOUT ME
  4. 4. Confidential and Proprietary4 BIG DATA PROBLEM
  5. 5. Confidential and Proprietary5 • Enable Online, Offline, and Mobile payment • 128M customers worldwide • $160B payment volume processed annually • Major retail locations accepting PayPal 20K today  2M end of 2013 • PayPal Here launching in US and international markets Petabye Data Problem & Growing BIG DATA PROBLEM @ PAYPAL
  6. 6. Confidential and Proprietary6 • Detect and prevent fraud • Assess credit risk • Relevant offer to our customers • Improve user experience • Provide better insights to our merchants BIG DATA POWERS PAYPAL ANALYTICS
  7. 7. Confidential and Proprietary7 GRAPH MINING PLATFORM
  8. 8. Confidential and Proprietary8 BIG DATA STACK Data Cloud
  9. 9. Confidential and Proprietary9 Traditional data processing abstraction -- TABLE • Rows • Columns • Data Types DATA ABSTRACTION
  10. 10. Confidential and Proprietary10 • Internet & WWW • Social network • PayPal payment network – accounts & transactions GRAPH IS EVERYWHERE
  11. 11. Confidential and Proprietary11 • Think like a vertex • Two basic operations − Fusion: aggregate information from neighbors to a set of entities − Diffusion: propagate information from a vertex to neighbors GRAPH COMPUTING
  12. 12. Confidential and Proprietary12 THING LIKE A VERTEX - FUSION
  13. 13. Confidential and Proprietary13 THINK LIKE A VERTEX - DIFFUSION
  14. 14. Confidential and Proprietary14 • Which graph mining engine to use? − GraphLab − Apache Giraph − Apache Hamas • Hadoop compatible − Data is on Hadoop − Leverage existing cluster infrastructure − Integration with Hadoop • Easy of deployment and update • Community GRAPH MINING ENGINE
  15. 15. Confidential and Proprietary15 • Apache open src implementation of Google Pregel on Hadoop • Send msg from a vertex to any other vertex • In-memory scalable system − Map-only jobs, Zookeeper, Netty BSP & GIRAPH
  16. 16. Confidential and Proprietary16 GRAPH MINING USE CASE
  17. 17. Confidential and Proprietary17 • Stop fraudsters from stealing money from PayPal payment network • Sophisticate risk models running in real-time based on − Online data − Offline data • Risk profile traditionally based on a variety of data − Account − Transaction -- frequency, amount, history − IP − Email domain RISK DETECTION & MITIGATION
  18. 18. Confidential and Proprietary18 RISK COMPUTATION Current TX Details Risk Models Approve DeclineHistory Data
  19. 19. Confidential and Proprietary19 • PayPal data are connected • Form multiple communities that have hidden inferences • Discover the inferences via a graph approach • Build a system to extract the inferences GRAPH MINING CONNECTED DATA
  20. 20. Confidential and Proprietary20 GRAPH VIEW OF DATA User1 User2 Merchant BUY BUY P2P Money Transfer
  21. 21. Confidential and Proprietary21 GRAPH VIEW OF DATA Account 1 IP1 IP2 Account 2 IP3
  22. 22. Confidential and Proprietary22 GRAPH MINING DATA PIPELINE Pre Processing Graph Processing Post Processing
  23. 23. Confidential and Proprietary23 • Input data is raw transaction data • Custom MapReduce jobs to pre-process data into graph model • Output is JSON format of adjacent node list − Easy to consume in Java and by humans − Use gson library • Post processing – output format conversion GRAPH DATA PIPELINE
  24. 24. Confidential and Proprietary24 • Customers/Accounts linked via transactions • Compute risk = intrinsic risk + risk propagated from peers • Send risk message to peers • Iterate till converge GRAPH PROCESSING Cus1 Cus2 Transaction T1 Transaction T0 Transaction T2 Transaction T3
  25. 25. Confidential and Proprietary25 IP3 IP2 GRAPH PROCESSING Account 1 IP1 IP2 Account 2 IP3IP1
  26. 26. Confidential and Proprietary26 LESSONS LEARNED
  27. 27. Confidential and Proprietary27 • Giraph is an emerging technology − Incubation in 2012 − Rapidly evolving − 0.1 and 0.2 are not compatible − Lack of knowledge & doc • Build internal git repo • Read code and join mailing list • Port code from 0.1 to 0.2 • Use Giraph 1.0 released on May 6 2013 GIRAPH
  28. 28. Confidential and Proprietary28 • Must guarantee minimum number of Mappers • Capacity scheduler − set MIN mapper of queue > Giraph job needs • Fair scheduler − set MIN mapper of queue > Giraph job needs − Turn on pre-emption − Set pre-emption wait time to a small interval – 20 sec HADOOP ENVIRONMENT INTEGRATION
  29. 29. Confidential and Proprietary29 • Memory constraint in a shared Hadoop environment − 1.2B edges and 300M nodes − Single purpose POC cluster mapper memory = 10 GB − Shared R&D cluster mapper memory = 3 GB • Reduce memory consumption is key − Convert String to long for graph processing − Convert back to String in post-processing for downstream application − Cap the number of messages passed − distance from current vertex − message payload data values MEMORY SCALABILITY
  30. 30. Confidential and Proprietary30 • Giraph-based data engine to produce enriched data set • Leverage Giraph on YARN • Number of worker scalability FUTURE WORK
  31. 31. Q&A WE ARE HIRING
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×