SlideShare a Scribd company logo
June, 2013
Jay Tang
GRAPH MINING WITH APACHE
GIRAPH
Confidential and Proprietary2
• Introduction
• Big Data problem
• Graph mining platform
• Use case
• Lessons
• Future work
AGENDA
Confidential and Proprietary3
• Director of Big Data Platform & Analytics, PayPal
− Hadoop, Graph mining, Real-time analytics, ML, text mining
• 20 years of software experience in the valley focused on data
• Member of original Hadoop team @Yahoo
• Built data warehouse, relational database, OLAP product
@Yahoo, Oracle/Hyperion, IBM Informix, DB2
ABOUT ME
Confidential and Proprietary4
BIG DATA PROBLEM
Confidential and Proprietary5
• Enable Online, Offline, and Mobile payment
• 128M customers worldwide
• $160B payment volume processed annually
• Major retail locations accepting PayPal
20K today  2M end of 2013
• PayPal Here launching in US and international markets
Petabye Data Problem & Growing
BIG DATA PROBLEM @ PAYPAL
Confidential and Proprietary6
• Detect and prevent fraud
• Assess credit risk
• Relevant offer to our customers
• Improve user experience
• Provide better insights to our merchants
BIG DATA POWERS PAYPAL ANALYTICS
Confidential and Proprietary7
GRAPH MINING PLATFORM
Confidential and Proprietary8
BIG DATA STACK
Data
Cloud
Confidential and Proprietary9
Traditional data processing abstraction -- TABLE
• Rows
• Columns
• Data Types
DATA ABSTRACTION
Confidential and Proprietary10
• Internet & WWW
• Social network
• PayPal payment network – accounts & transactions
GRAPH IS EVERYWHERE
Confidential and Proprietary11
• Think like a vertex
• Two basic operations
− Fusion: aggregate information from neighbors to a set of entities
− Diffusion: propagate information from a vertex to neighbors
GRAPH COMPUTING
Confidential and Proprietary12
THING LIKE A VERTEX - FUSION
Confidential and Proprietary13
THINK LIKE A VERTEX - DIFFUSION
Confidential and Proprietary14
• Which graph mining engine to use?
− GraphLab
− Apache Giraph
− Apache Hamas
• Hadoop compatible
− Data is on Hadoop
− Leverage existing cluster infrastructure
− Integration with Hadoop
• Easy of deployment and update
• Community
GRAPH MINING ENGINE
Confidential and Proprietary15
• Apache open src implementation of Google Pregel on Hadoop
• Send msg from a vertex to any other vertex
• In-memory scalable system
− Map-only jobs, Zookeeper, Netty
BSP & GIRAPH
Confidential and Proprietary16
GRAPH MINING USE CASE
Confidential and Proprietary17
• Stop fraudsters from stealing money from PayPal payment
network
• Sophisticate risk models running in real-time based on
− Online data
− Offline data
• Risk profile traditionally based on a variety of data
− Account
− Transaction -- frequency, amount, history
− IP
− Email domain
RISK DETECTION & MITIGATION
Confidential and Proprietary18
RISK COMPUTATION
Current TX Details
Risk Models
Approve
DeclineHistory Data
Confidential and Proprietary19
• PayPal data are connected
• Form multiple communities that have hidden inferences
• Discover the inferences via a graph approach
• Build a system to extract the inferences
GRAPH MINING CONNECTED DATA
Confidential and Proprietary20
GRAPH VIEW OF DATA
User1
User2
Merchant
BUY
BUY
P2P Money
Transfer
Confidential and Proprietary21
GRAPH VIEW OF DATA
Account 1
IP1 IP2
Account 2
IP3
Confidential and Proprietary22
GRAPH MINING DATA PIPELINE
Pre
Processing
Graph Processing
Post
Processing
Confidential and Proprietary23
• Input data is raw transaction data
• Custom MapReduce jobs to pre-process data into graph
model
• Output is JSON format of adjacent node list
− Easy to consume in Java and by humans
− Use gson library
• Post processing – output format conversion
GRAPH DATA PIPELINE
Confidential and Proprietary24
• Customers/Accounts linked via transactions
• Compute risk = intrinsic risk + risk propagated from peers
• Send risk message to peers
• Iterate till converge
GRAPH PROCESSING
Cus1
Cus2
Transaction T1
Transaction T0
Transaction T2
Transaction T3
Confidential and Proprietary25
IP3
IP2
GRAPH PROCESSING
Account 1
IP1 IP2
Account 2
IP3IP1
Confidential and Proprietary26
LESSONS LEARNED
Confidential and Proprietary27
• Giraph is an emerging technology
− Incubation in 2012
− Rapidly evolving
− 0.1 and 0.2 are not compatible
− Lack of knowledge & doc
• Build internal git repo
• Read code and join mailing list
• Port code from 0.1 to 0.2
• Use Giraph 1.0 released on May 6 2013
GIRAPH
Confidential and Proprietary28
• Must guarantee minimum number of Mappers
• Capacity scheduler
− set MIN mapper of queue > Giraph job needs
• Fair scheduler
− set MIN mapper of queue > Giraph job needs
− Turn on pre-emption
− Set pre-emption wait time to a small interval – 20 sec
HADOOP ENVIRONMENT INTEGRATION
Confidential and Proprietary29
• Memory constraint in a shared Hadoop environment
− 1.2B edges and 300M nodes
− Single purpose POC cluster mapper memory = 10 GB
− Shared R&D cluster mapper memory = 3 GB
• Reduce memory consumption is key
− Convert String to long for graph processing
− Convert back to String in post-processing for downstream application
− Cap the number of messages passed
− distance from current vertex
− message payload data values
MEMORY SCALABILITY
Confidential and Proprietary30
• Giraph-based data engine to produce enriched data set
• Leverage Giraph on YARN
• Number of worker scalability
FUTURE WORK
Q&A
WE ARE HIRING

More Related Content

What's hot

What's hot (20)

Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPop
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Viewers also liked

Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

Viewers also liked (14)

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Similar to Hadoop Graph Processing with Apache Giraph

Building a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineBuilding a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data Pipeline
DataWorks Summit
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
Christopher Curtin
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
Redis Labs
 

Similar to Hadoop Graph Processing with Apache Giraph (20)

Building a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineBuilding a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data Pipeline
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
 
Using Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for TelcosUsing Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for Telcos
 
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyondRakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
 
PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
 
KNIME Software Overview
KNIME Software OverviewKNIME Software Overview
KNIME Software Overview
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Accumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queriesAccumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queries
 
Big Data LDN 2016: When Big Data Meets Fast Data
Big Data LDN 2016: When Big Data Meets Fast DataBig Data LDN 2016: When Big Data Meets Fast Data
Big Data LDN 2016: When Big Data Meets Fast Data
 
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Spark: Building an application from Start to Finish
Spark: Building an application from Start to FinishSpark: Building an application from Start to Finish
Spark: Building an application from Start to Finish
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platform
 
Hadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsHadoop: The Unintended Benefits
Hadoop: The Unintended Benefits
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 

Hadoop Graph Processing with Apache Giraph

  • 1. June, 2013 Jay Tang GRAPH MINING WITH APACHE GIRAPH
  • 2. Confidential and Proprietary2 • Introduction • Big Data problem • Graph mining platform • Use case • Lessons • Future work AGENDA
  • 3. Confidential and Proprietary3 • Director of Big Data Platform & Analytics, PayPal − Hadoop, Graph mining, Real-time analytics, ML, text mining • 20 years of software experience in the valley focused on data • Member of original Hadoop team @Yahoo • Built data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2 ABOUT ME
  • 5. Confidential and Proprietary5 • Enable Online, Offline, and Mobile payment • 128M customers worldwide • $160B payment volume processed annually • Major retail locations accepting PayPal 20K today  2M end of 2013 • PayPal Here launching in US and international markets Petabye Data Problem & Growing BIG DATA PROBLEM @ PAYPAL
  • 6. Confidential and Proprietary6 • Detect and prevent fraud • Assess credit risk • Relevant offer to our customers • Improve user experience • Provide better insights to our merchants BIG DATA POWERS PAYPAL ANALYTICS
  • 8. Confidential and Proprietary8 BIG DATA STACK Data Cloud
  • 9. Confidential and Proprietary9 Traditional data processing abstraction -- TABLE • Rows • Columns • Data Types DATA ABSTRACTION
  • 10. Confidential and Proprietary10 • Internet & WWW • Social network • PayPal payment network – accounts & transactions GRAPH IS EVERYWHERE
  • 11. Confidential and Proprietary11 • Think like a vertex • Two basic operations − Fusion: aggregate information from neighbors to a set of entities − Diffusion: propagate information from a vertex to neighbors GRAPH COMPUTING
  • 12. Confidential and Proprietary12 THING LIKE A VERTEX - FUSION
  • 13. Confidential and Proprietary13 THINK LIKE A VERTEX - DIFFUSION
  • 14. Confidential and Proprietary14 • Which graph mining engine to use? − GraphLab − Apache Giraph − Apache Hamas • Hadoop compatible − Data is on Hadoop − Leverage existing cluster infrastructure − Integration with Hadoop • Easy of deployment and update • Community GRAPH MINING ENGINE
  • 15. Confidential and Proprietary15 • Apache open src implementation of Google Pregel on Hadoop • Send msg from a vertex to any other vertex • In-memory scalable system − Map-only jobs, Zookeeper, Netty BSP & GIRAPH
  • 17. Confidential and Proprietary17 • Stop fraudsters from stealing money from PayPal payment network • Sophisticate risk models running in real-time based on − Online data − Offline data • Risk profile traditionally based on a variety of data − Account − Transaction -- frequency, amount, history − IP − Email domain RISK DETECTION & MITIGATION
  • 18. Confidential and Proprietary18 RISK COMPUTATION Current TX Details Risk Models Approve DeclineHistory Data
  • 19. Confidential and Proprietary19 • PayPal data are connected • Form multiple communities that have hidden inferences • Discover the inferences via a graph approach • Build a system to extract the inferences GRAPH MINING CONNECTED DATA
  • 20. Confidential and Proprietary20 GRAPH VIEW OF DATA User1 User2 Merchant BUY BUY P2P Money Transfer
  • 21. Confidential and Proprietary21 GRAPH VIEW OF DATA Account 1 IP1 IP2 Account 2 IP3
  • 22. Confidential and Proprietary22 GRAPH MINING DATA PIPELINE Pre Processing Graph Processing Post Processing
  • 23. Confidential and Proprietary23 • Input data is raw transaction data • Custom MapReduce jobs to pre-process data into graph model • Output is JSON format of adjacent node list − Easy to consume in Java and by humans − Use gson library • Post processing – output format conversion GRAPH DATA PIPELINE
  • 24. Confidential and Proprietary24 • Customers/Accounts linked via transactions • Compute risk = intrinsic risk + risk propagated from peers • Send risk message to peers • Iterate till converge GRAPH PROCESSING Cus1 Cus2 Transaction T1 Transaction T0 Transaction T2 Transaction T3
  • 25. Confidential and Proprietary25 IP3 IP2 GRAPH PROCESSING Account 1 IP1 IP2 Account 2 IP3IP1
  • 27. Confidential and Proprietary27 • Giraph is an emerging technology − Incubation in 2012 − Rapidly evolving − 0.1 and 0.2 are not compatible − Lack of knowledge & doc • Build internal git repo • Read code and join mailing list • Port code from 0.1 to 0.2 • Use Giraph 1.0 released on May 6 2013 GIRAPH
  • 28. Confidential and Proprietary28 • Must guarantee minimum number of Mappers • Capacity scheduler − set MIN mapper of queue > Giraph job needs • Fair scheduler − set MIN mapper of queue > Giraph job needs − Turn on pre-emption − Set pre-emption wait time to a small interval – 20 sec HADOOP ENVIRONMENT INTEGRATION
  • 29. Confidential and Proprietary29 • Memory constraint in a shared Hadoop environment − 1.2B edges and 300M nodes − Single purpose POC cluster mapper memory = 10 GB − Shared R&D cluster mapper memory = 3 GB • Reduce memory consumption is key − Convert String to long for graph processing − Convert back to String in post-processing for downstream application − Cap the number of messages passed − distance from current vertex − message payload data values MEMORY SCALABILITY
  • 30. Confidential and Proprietary30 • Giraph-based data engine to produce enriched data set • Leverage Giraph on YARN • Number of worker scalability FUTURE WORK

Editor's Notes

  1. Input data size, giraph data size