SlideShare a Scribd company logo
Graph Based Representation Learning
To Prevent Payment Collusion Fraud
Agenda Problem
Approach
Experiments
Conclusion
©2017 PayPal Inc. Confidential and proprietary.
PROBLEM
Fraud Prevention @ PayPal
Robust feature engineering, machine
learning and statistical models
Highly scalable and multi-layered
infrastructure software
Superior team of data scientists,
researchers, financial and intelligence
analysts
Images source:
Collusion Fraud – An Example Scenario
Buyer purchases an item using PayPal
Seller ships item to buyer
Buyer finds box empty & asks PayPal for
refund
Images source:
PayPal can refund buyer (eligible transactions under Buyer Protection)
Collusion Fraud – An Example Scenario
PayPal incur loss
Seller gives proof
Buyer and seller split the money..
Images source:
Collusion Fraud
PayPal asks seller for
proof
Repays seller (Seller Protection)
© 2017 PayPal Inc. Confidential and proprietary.
Collusion Fraud
• What if there are many such buyers colluding with many such sellers?
• What if such buyers & sellers stay below the radar? ($Transaction <
Threshold)?
• What if such sellers behave well with majority of the buyers & collude
with a few?
• How do we detect if the buyers and sellers are legitimate or if they are
colluding with each other?
7
© 2017 PayPal Inc. Confidential and proprietary.
Collusion Fraud
8
• Can we exploit network structure of
fraudsters to solve collusion fraud?
Buyer
Buyer
Buyer
Buyer
IP1
IP2
IP3
Seller1
Seller1
Seller1
Ships empty box
Ships good
Logs in
Purchase item
APPROACH
© 2017 PayPal Inc. Confidential and proprietary.
Traditional ML vs End-to-End Learning
• Traditional ML –
Significant human effort
in engineering features &
labelling
• End-to-End Learning –
Use algorithm (such as
Deep learning) to learn
feature representations
automatically
• need tabular data
10
Raw
Data
Expert Driven
Feature Engineering
Features Algorithm
Models
Raw
Data
Feature Learning (Representation Learning)
Traditional ML
End-to-End Learning
Images source:
© 2017 PayPal Inc. Confidential and proprietary.
Approach
• Learn features from graphs & use Driverless AI to engineer additional
features and build model.
• Feature learning on graphs is fairly new research area & its full potential
has not been realized.
• Graph based feature learning helps to understand fraud network & thus
prevent collusion and other organized crimes
11
Models
Raw
Data
Representation Learning
(Automatic Feature)
Driverless AI
(Feature Engineering +
Model Training
Images source:
Expert Engineered
Feature
Feature
Tabular Data
© 2017 PayPal Inc. Confidential and proprietary.
Approach – Graph Based Representation Learning
• Idea
• You shall know a word (node) by the company (neigbhor) it keep(s)* (Firth, J. R. 1957)
• Word2Vec* - Continuous feature representation for words
• Suppose user searches for “hotel”, we want to also match “motel”
• One hot representation (discrete)
• Build a dense vector to predict other words from context
• Two algorithms – Skip Gram (SG) (Predict context words from target) &
Continuous Bag of Words (CBOW) (Predict target from context)
• Graph Based Representation Learning
• Learn continuous feature representation for nodes
• Representation incorporates community a node belong to & role they play
12*source: prof. manning’s lecture notes
(http://web.stanford.edu/class/cs224n/)
© 2017 PayPal Inc. Confidential and proprietary.
Algorithm – node2vec*Grover & Leskovec, 2016
13
*source: https://arxiv.org/pdf/1607.00653.pdf
• Node2Vec
• s1, s2, s3, s4, u - same
community
• u & s6 also play the role of hub
• “neigborhood preserving” graph
based objective function
optimized using SGD
• MLE optimization problem
• BFS & DFS sampling to
generate neigbhors
© 2017 PayPal Inc. Confidential and proprietary.
Implementation Framework
14
Raw Events
(HDFS)
Human Expert
Node
Representation
Feature vector
node2vec
Graph DB
Expert
Engineered
Feature vector
Driverless AI
(Feature Engineering +
Model Training)
EXPERIMENTS
© 2017 PayPal Inc. Confidential and proprietary.
Datasets
16
• Training Data
• Subset of 1 year transactions
• 1.5 billion edges & 0.5 million nodes
• Test Data
• 3 months
• # of features
• 400 - 600
© 2017 PayPal Inc. Confidential and proprietary.
Environment &Tools
17
• Node2Vec – Node representation learning
• Driverless AI – Feature Engineering & ModelTraining
• Spark – Data Preparation/Pre-processing
• Hardware – GPU server
• 4 Pascal 100 GPU
• 160 cores CPUs
• 1 TB RAM
© 2017 PayPal Inc. Confidential and proprietary.
Experiment
18
• Training time (subset of data) – Driverless AI on GPU 6x faster
• laptop (accuracy 1) - ~ 2 hours
• GPU (accuracy 1) – 21 minutes; (accuracy 5) – 58 minutes
© 2017 PayPal Inc. Confidential and proprietary. 19
Experiment
• Top 5 variables from DAI
• AUC – 0.9477
CONCLUSIONS
© 2017 PayPal Inc. Confidential and proprietary.
Conclusions
21
• Graph based representation learning yield robust feature set for complex
fraud patterns such as collusion fraud
• Driverless AI not only help to engineer additional features automatically
but also significantly improve model training time (under 2 hours).
• Next steps
• Evaluate Driverless AI results on out-of-time data sets.
• Evaluate Driverless AI directly on raw data
• Evaluate representation learning on edges and weighted graphs
• Machine learning on graphs

More Related Content

What's hot

XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
Gabriel Cypriano Saca
 
UX: internal search for e-commerce
UX: internal search for e-commerceUX: internal search for e-commerce
UX: internal search for e-commerce
Myriam Jessier
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
Jeong-Yoon Lee
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Databricks
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
Databricks
 
Basics of Deep learning
Basics of Deep learningBasics of Deep learning
Basics of Deep learning
Ramesh Kumar
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
Dong Heon Cho
 
[PAP] 실무자를 위한 인과추론 활용 : Best Practices
[PAP] 실무자를 위한 인과추론 활용 : Best Practices[PAP] 실무자를 위한 인과추론 활용 : Best Practices
[PAP] 실무자를 위한 인과추론 활용 : Best Practices
Bokyung Choi
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
Ted Xiao
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
Olivier Jeunen
 
AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...
AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...
AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...
EtoutsSeo
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
Rahul Borate
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slides
OmranHakami
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Edureka!
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
Alice Zheng
 

What's hot (20)

XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
UX: internal search for e-commerce
UX: internal search for e-commerceUX: internal search for e-commerce
UX: internal search for e-commerce
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Basics of Deep learning
Basics of Deep learningBasics of Deep learning
Basics of Deep learning
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
 
[PAP] 실무자를 위한 인과추론 활용 : Best Practices
[PAP] 실무자를 위한 인과추론 활용 : Best Practices[PAP] 실무자를 위한 인과추론 활용 : Best Practices
[PAP] 실무자를 위한 인과추론 활용 : Best Practices
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...
AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...
AI and Machine Learning Revolution_ Transforming Digital Marketing for Succes...
 
Text Classification
Text ClassificationText Classification
Text Classification
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slides
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 

Similar to Graph representation learning to prevent payment collusion fraud

Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Sri Ambati
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
Amazon Web Services
 
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
Amazon Web Services
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
Max De Marzi
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
Big Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
Big Data Joe™ Rossi
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Neo4j
 
Hadoop summit 2017 enterprise graph analytics
Hadoop summit 2017 enterprise graph analyticsHadoop summit 2017 enterprise graph analytics
Hadoop summit 2017 enterprise graph analytics
Jun(Terry) Yang
 
AI & AWS DeepComposer
AI & AWS DeepComposerAI & AWS DeepComposer
AI & AWS DeepComposer
Amazon Web Services
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
Deep learning systems model serving
Deep learning systems   model servingDeep learning systems   model serving
Deep learning systems model serving
Hagay Lupesko
 
Hadoop Summit 2017 Enterprise Graph Analytics
Hadoop Summit 2017 Enterprise Graph AnalyticsHadoop Summit 2017 Enterprise Graph Analytics
Hadoop Summit 2017 Enterprise Graph Analytics
Jing Chen (Jerry) He
 
Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)
ShehryarSH1
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Synerzip
 
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Neo4j
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3
Dimitri van Hees
 
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
Amazon Web Services
 
Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...
Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...
Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...
Amazon Web Services
 
Machine Learning Projects @ commercetools
Machine Learning Projects @ commercetoolsMachine Learning Projects @ commercetools
Machine Learning Projects @ commercetools
Amadeus Magrabi
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
Neo4j
 

Similar to Graph representation learning to prevent payment collusion fraud (20)

Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
 
Hadoop summit 2017 enterprise graph analytics
Hadoop summit 2017 enterprise graph analyticsHadoop summit 2017 enterprise graph analytics
Hadoop summit 2017 enterprise graph analytics
 
AI & AWS DeepComposer
AI & AWS DeepComposerAI & AWS DeepComposer
AI & AWS DeepComposer
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
 
Deep learning systems model serving
Deep learning systems   model servingDeep learning systems   model serving
Deep learning systems model serving
 
Hadoop Summit 2017 Enterprise Graph Analytics
Hadoop Summit 2017 Enterprise Graph AnalyticsHadoop Summit 2017 Enterprise Graph Analytics
Hadoop Summit 2017 Enterprise Graph Analytics
 
Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
 
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3
 
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
 
Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...
Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...
Workshop Build an Image-Based Automatic Alert System with Amazon Rekognition:...
 
Machine Learning Projects @ commercetools
Machine Learning Projects @ commercetoolsMachine Learning Projects @ commercetools
Machine Learning Projects @ commercetools
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

Graph representation learning to prevent payment collusion fraud

  • 1. Graph Based Representation Learning To Prevent Payment Collusion Fraud
  • 4. Fraud Prevention @ PayPal Robust feature engineering, machine learning and statistical models Highly scalable and multi-layered infrastructure software Superior team of data scientists, researchers, financial and intelligence analysts Images source:
  • 5. Collusion Fraud – An Example Scenario Buyer purchases an item using PayPal Seller ships item to buyer Buyer finds box empty & asks PayPal for refund Images source: PayPal can refund buyer (eligible transactions under Buyer Protection)
  • 6. Collusion Fraud – An Example Scenario PayPal incur loss Seller gives proof Buyer and seller split the money.. Images source: Collusion Fraud PayPal asks seller for proof Repays seller (Seller Protection)
  • 7. © 2017 PayPal Inc. Confidential and proprietary. Collusion Fraud • What if there are many such buyers colluding with many such sellers? • What if such buyers & sellers stay below the radar? ($Transaction < Threshold)? • What if such sellers behave well with majority of the buyers & collude with a few? • How do we detect if the buyers and sellers are legitimate or if they are colluding with each other? 7
  • 8. © 2017 PayPal Inc. Confidential and proprietary. Collusion Fraud 8 • Can we exploit network structure of fraudsters to solve collusion fraud? Buyer Buyer Buyer Buyer IP1 IP2 IP3 Seller1 Seller1 Seller1 Ships empty box Ships good Logs in Purchase item
  • 10. © 2017 PayPal Inc. Confidential and proprietary. Traditional ML vs End-to-End Learning • Traditional ML – Significant human effort in engineering features & labelling • End-to-End Learning – Use algorithm (such as Deep learning) to learn feature representations automatically • need tabular data 10 Raw Data Expert Driven Feature Engineering Features Algorithm Models Raw Data Feature Learning (Representation Learning) Traditional ML End-to-End Learning Images source:
  • 11. © 2017 PayPal Inc. Confidential and proprietary. Approach • Learn features from graphs & use Driverless AI to engineer additional features and build model. • Feature learning on graphs is fairly new research area & its full potential has not been realized. • Graph based feature learning helps to understand fraud network & thus prevent collusion and other organized crimes 11 Models Raw Data Representation Learning (Automatic Feature) Driverless AI (Feature Engineering + Model Training Images source: Expert Engineered Feature Feature Tabular Data
  • 12. © 2017 PayPal Inc. Confidential and proprietary. Approach – Graph Based Representation Learning • Idea • You shall know a word (node) by the company (neigbhor) it keep(s)* (Firth, J. R. 1957) • Word2Vec* - Continuous feature representation for words • Suppose user searches for “hotel”, we want to also match “motel” • One hot representation (discrete) • Build a dense vector to predict other words from context • Two algorithms – Skip Gram (SG) (Predict context words from target) & Continuous Bag of Words (CBOW) (Predict target from context) • Graph Based Representation Learning • Learn continuous feature representation for nodes • Representation incorporates community a node belong to & role they play 12*source: prof. manning’s lecture notes (http://web.stanford.edu/class/cs224n/)
  • 13. © 2017 PayPal Inc. Confidential and proprietary. Algorithm – node2vec*Grover & Leskovec, 2016 13 *source: https://arxiv.org/pdf/1607.00653.pdf • Node2Vec • s1, s2, s3, s4, u - same community • u & s6 also play the role of hub • “neigborhood preserving” graph based objective function optimized using SGD • MLE optimization problem • BFS & DFS sampling to generate neigbhors
  • 14. © 2017 PayPal Inc. Confidential and proprietary. Implementation Framework 14 Raw Events (HDFS) Human Expert Node Representation Feature vector node2vec Graph DB Expert Engineered Feature vector Driverless AI (Feature Engineering + Model Training)
  • 16. © 2017 PayPal Inc. Confidential and proprietary. Datasets 16 • Training Data • Subset of 1 year transactions • 1.5 billion edges & 0.5 million nodes • Test Data • 3 months • # of features • 400 - 600
  • 17. © 2017 PayPal Inc. Confidential and proprietary. Environment &Tools 17 • Node2Vec – Node representation learning • Driverless AI – Feature Engineering & ModelTraining • Spark – Data Preparation/Pre-processing • Hardware – GPU server • 4 Pascal 100 GPU • 160 cores CPUs • 1 TB RAM
  • 18. © 2017 PayPal Inc. Confidential and proprietary. Experiment 18 • Training time (subset of data) – Driverless AI on GPU 6x faster • laptop (accuracy 1) - ~ 2 hours • GPU (accuracy 1) – 21 minutes; (accuracy 5) – 58 minutes
  • 19. © 2017 PayPal Inc. Confidential and proprietary. 19 Experiment • Top 5 variables from DAI • AUC – 0.9477
  • 21. © 2017 PayPal Inc. Confidential and proprietary. Conclusions 21 • Graph based representation learning yield robust feature set for complex fraud patterns such as collusion fraud • Driverless AI not only help to engineer additional features automatically but also significantly improve model training time (under 2 hours). • Next steps • Evaluate Driverless AI results on out-of-time data sets. • Evaluate Driverless AI directly on raw data • Evaluate representation learning on edges and weighted graphs • Machine learning on graphs