SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Benyue (Emma) Liu, TigerGraph Inc.
Real-time Fraud Detection at
Scale - Integrating Real-Time
Deep-Link Graph Analytics with
Spark AI
#UnifiedDataAnalytics #SparkAISummit
Graph analysis is possibly the single most effective
competitive differentiator for organizations pursuing data-driven
operations and decisions after the design of data capture.”
Graph is HOW WE THINK
4#UnifiedDataAnalytics #SparkAISummit
Common TigerGraph Use Cases
5
Improve Operational EfficiencyReduce Costs & Manage RisksIncrease Revenue
• Recommendation Engine
• Real-time Customer 360/
MDM
• Product & Service Marketing
• Fraud Detection
• Anti-Money Laundering
(AML)
• Risk Assessment & Monitoring
• Cyber Security
• Enterprise Knowledge Graph
• Network, IT and Cloud
Resource Optimization
• Energy Management System
• Supply Chain Analysis
Analyze all interactions
in real-time to sell more
Reduce costs and assess and
monitor risks effectively
Manage resources for
maximum output
Foundational Use Cases: Geospatial Analysis, Time Series Analysis, AI and Machine Learning
7 Key Data Science Capabilities Powered By a Native Parallel Graph
Deep Link Analysis
Relational Commonality
Discovery and Computation
From a set of entities (e.g. devices,
customers, accounts, doctors), show
all links or connections
Given 2 entities (e.g. customers,
businesses), follow their
relationships to find commonality
6
Multi-dimensional Entity
& Pattern Matching
Given a pattern (e.g. referring
business to a relative), find similar
patterns in the graph
Hub & Community Detection
Find most influential members of a
group (customers, doctors, citizens)
& detect community around them
Community 1
Community 2
1 32 4
5 Geospatial Graph Analysis Analyze changes in entities & relationships with location data
A
C
A
B
Machine Learning Feature
Generation & Explainable AI
Extract graph-based features to feed as training data for
machine learning; Power Explainable AI7
Temporal (Time-Series) Graph Analysis Analyze changes in entities & relationships over time
Query Pattern P
MatchB
D
Power Explainable AI with TigerGraph
7
Why Spark + TigerGraph?
+
Spark + TigerGraph Data Pipeline
9
Typical Spark + TigerGraph Integration
● Data Preparation and Integration (TigerGraph/Spark)
● Unsupervised Learning (TigerGraph)
● Feature Extraction for Supervised Learning (TigerGraph/Spark)
● Model Training (Spark)
● Validate and Apply Model (TigerGraph)
● Visualize and Explore Interconnected Data (TigerGraph)
10
Machine Learning with TigerGraph
China Mobile Anti-Fraud/Scam Detection
12
Real-Time Phone-Based Fraud Detection
Massive, Worldwide Problem
● 18 Billion robocalls in US in 2017 (hiya.com)
● Spam/Scam - agile, spoofed numbers
Customer:
● 600M subscribers
● 300M calls/day, peak 10K calls/sec
● Need: Real-time detection of various
types of phone-based fraud
Real-Time Phone Anti-Spam/Scam Detection
13
TigerGraph Solution: Real-time graph-based machine learning and
decision system
Graph Analytics
● Real-time machine learning
○ 118 graph features per call
○ Retrained periodically with
2M calls
● Real-time decisions
○ Call recipient sees alert if
ML system says call is
suspicious
● In production since Dec 2016
Graph Database
● 600M phone numbers
(inside and outside network)
● 15B phone-phone call edges
(2 month sliding window)
○ Time
○ Duration
● Real-time graph updates
Peak 10K+ calls/sec
● 118 graph features per phone
Examples of Graph Features for Machine Learning
14
Good Phone
Features
Bad Phone
Features
(1) Short term call
duration
(2) Empty stable group
(3) No call back phone
(4) Many rejected calls
(5) Average distance > 3
Empty stable group
Many rejected
calls
Average
distance > 3
(1) High call back
phone
(2) Stable group
(3) Long term phone
(4) Many in-group
connections
(5) 3-step friend relation
Stable
group
Many in-
group
connections
Good Phone
Features
3-step friend
relation
///
Good phone Bad phone
X
X
X
China Mobile - Detecting Phone-Based Fraud by
Analyzing Network or Graph Pattern Features
15
• Each phone node has a fraud flag,
indicating it’s a good phone or a bad phone
and what type of fraud: scam, harassment,
advertisement
• Run real-time GSQL query for each call:
○ Collect 118 features
○ Compute composite score
○ Update fraud flag
○ Return fraud type
Real-Time Call Event
Caller
Callee
Time
Call Detail Records
Caller
Callee
Time
Duration
Query
Continuous
Graph Update
Fraud Type
Phone Fraud Real-Time Detection System
phone vertex
- fraud flag
- expiration time
target4
target3
- num of call
- total duration
- call date list
- num of rejection
target2
target1
● 600 Million Vertices
● 15+ Billion Edges
● 300 Million Daily
Updatesphone_phone
Case 1: Call type was recently flagged
Real-time
Call Event Call Time
Caller ID
Callee ID
If caller was
recently
flagged as
“bad”
If Caller is
classified as
“bad”Classifier
Query
Real-time
Collect Caller’s
Graph Features
Update
Case 2: Call needs to be classified
Real-time
Call Event Call Time
Caller ID
Callee ID
If caller was
recently
flagged as
“bad”
If Caller is
classified as
“bad”Classifier
Query
Real-time
Collect Caller’s
Graph Features
Update
Input: list of
calls with
phone pairs
and call time
(batch)
Output: 1. Call fraud type; 2. Scoring and feature vector
of fraud calls for supporting evidence Explainable AI
China Mobile Machine Learning Workflow
1. Data labels from police reports and online third party sources
2. A total of 118 graph features analyzed to build fraud detection model
3. All 118 graph features collected by one GSQL query
4. Training data’s features collected in GSQL in batch processing and stored
as CSV file for future model training
5. TigerGraph performs fraud scoring with multiple Machine Learning models
in real-time
6. Machine Learning models are trained offline and model parameters stored
as configuration files for GSQL to use for real-time scoring
(Future: Training ML models in Spark)
Machine Learning with TigerGraph
Real-time Scoring with Multiple ML models in GSQL
Efficient EasyFast
Real-time
response for both
feature collection
and scoring
Aggregation during
traversal - multiple
features in one
Collect complex
features without
multiple RDBMS
joins
China Mobile Anti-Fraud Results
from TigerGraph Machine Learning Solutions
• 3.2 million fraud notifications
in Shandong Province
(Dec 2016 – July 2019)
• Save potential loss
• ~39.86 million RMB
(~ 6 million US dollars)
Why Spark + TigerGraph?
+
Why TigerGraph + Spark For Machine Learning?
Parallel processing,
distributed systems
in training, ETL &
feature collections
Capture business
moments with real-
time response with
explainable AI
23
Enrich machine
learning with
complex graph
features
AT SCALE ! AT SCALE ! AT SCALE !
Spark and TigerGraph Data Pipeline
Static
Data
Sources
TigerGraph
JDBC
Driver
Streaming
Data
Sources
JDBC Driver (v1.2)
● Type 4 driver
● Support Read and Write bi-directional data flow to TigerGraph
● Read: Converts ResultSet to DataFrame
● Write: Load DataFrame and files to vertex/edge in TigerGraph
● Supports REST endpoints of built-in, compiled and interpreted GSQL queries from
TigerGraph
● Open Source:
● https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc-driver
DEMO
Graph Feature Extraction from TigerGraph
to Spark Via TigerGraph’s JDBC Driver
26
Examples of Graph Features for Machine Learning
27
Good Phone
Features
Bad Phone
Features
(1) Short term call
duration
(2) Empty stable group
(3) No call back phone
(4) Many rejected calls
(5) Average distance > 3
Empty stable group
Many rejected
calls
Average
distance > 3
(1) High call back
phone
(2) Stable group
(3) Long term phone
(4) Many in-group
connections
(5) 3-step friend relation
Stable
group
Many in-
group
connections
Good Phone
Features
3-step friend
relation
///
Good phone Bad phone
X
X
X
Graph Features: Stable Group & InGroup
Connection
• Stable Group: phones in the target group that have regular calls
(stable connection) with source phone
• Stable InGroup Connections: phones in the target group that have
regular calls (stable connection) among themselves
Stable Connection defined as
● Has both call and callback
● Num of calls is larger than a given limit
● Total duration is larger than a given limit
Resources
• TigerGraph Cloud Machine Learning Starter Kit
a. Register at tgcloud.us
• JDBC Driver (Open Source)
a. https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc-
driver
• Contact me at emma.liu@tigergraph.com
29
More … TigerGraph & Neural Network
30
Training data: https://www.coursera.org/learn/machine-learning
Watch Graph Guru Episode 19
https://info.tigergraph.com/graph-gurus-19
Contact Me:
emma.liu@tigergraph.com
Graph analysis is possibly the single most effective
competitive differentiator for organizations pursuing data-driven
operations and decisions after the design of data capture.”
Realtime deep link graph analytics at scale is the
differentiator to your machine learning pipeline!
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Backup Slides
Stable Group Pseudocode
Step 1: start from a given phone vertex,
find its 1-step neighbors
Step 2: check if a target has both stable
outgoing (phone_phone) and stable
incoming edges (phone_phone_reversed)
source
target4
target3
- num of call
- total duration
- call date list
- num of rejection
target2
target1
phone_phone
phone_phone
phone_phone_reversed
Stable Connection defined as
● Has both call and callback
● Num of calls is larger than a given limit
● Total duration is larger than a given limit
source
Stable InGroup Connections Pseudocode
Step 1: starting from a given phone vertex,
find its 1-step neighbors (target group)
Step 2: for each vertex in the target group,
find its 1-step neighbors and check for
stable connections
Step 3: check the stable target for each
vertex in the target group
source
target4
target3
- num of call
- total duration
- call date list
- num of rejection
target2
target1phone_phone
phone_phone
phone_phone_reversed
source
Stable Connection defined as
● Has both call and callback
● Num of calls is larger than a given limit
● Total duration is larger than a given limit

More Related Content

What's hot

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 

What's hot (20)

Use case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & DremioUse case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & Dremio
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Real-time personalization at scale by Salesforce CDP and Interaction Studio, ...
Real-time personalization at scale by Salesforce CDP and Interaction Studio, ...Real-time personalization at scale by Salesforce CDP and Interaction Studio, ...
Real-time personalization at scale by Salesforce CDP and Interaction Studio, ...
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Splunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaSplunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | Edureka
 
Big Data Hadoop Customer 360 Degree View
Big Data Hadoop Customer 360 Degree ViewBig Data Hadoop Customer 360 Degree View
Big Data Hadoop Customer 360 Degree View
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Replicate Salesforce Data in Real Time with Change Data Capture
Replicate Salesforce Data in Real Time with Change Data CaptureReplicate Salesforce Data in Real Time with Change Data Capture
Replicate Salesforce Data in Real Time with Change Data Capture
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfPrometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
 
Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst Practices
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Neo4j 4.1 overview
Neo4j 4.1 overviewNeo4j 4.1 overview
Neo4j 4.1 overview
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 

Similar to Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Analytics with Spark AI

How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
Connected Data World
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
PyData
 

Similar to Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Analytics with Spark AI (20)

Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
TigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial CrimesTigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial Crimes
 
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
 
Fraud prevention is better with TigerGraph inside
Fraud prevention is better with  TigerGraph insideFraud prevention is better with  TigerGraph inside
Fraud prevention is better with TigerGraph inside
 
Graph Gurus Episode 3: Anti Fraud and AML Part 1
Graph Gurus Episode 3: Anti Fraud and AML Part 1Graph Gurus Episode 3: Anti Fraud and AML Part 1
Graph Gurus Episode 3: Anti Fraud and AML Part 1
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
High availability, real-time and scalable architectures
High availability, real-time and scalable architecturesHigh availability, real-time and scalable architectures
High availability, real-time and scalable architectures
 
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
 
Graph+AI for Fin. Services
Graph+AI for Fin. ServicesGraph+AI for Fin. Services
Graph+AI for Fin. Services
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
 
Scaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & ExpertsScaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & Experts
 
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
 
big-data-anallytics.pptx
big-data-anallytics.pptxbig-data-anallytics.pptx
big-data-anallytics.pptx
 
Graph Gurus 24: How to Build Innovative Applications with TigerGraph Cloud
Graph Gurus 24: How to Build Innovative Applications with TigerGraph CloudGraph Gurus 24: How to Build Innovative Applications with TigerGraph Cloud
Graph Gurus 24: How to Build Innovative Applications with TigerGraph Cloud
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 

Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Analytics with Spark AI

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Benyue (Emma) Liu, TigerGraph Inc. Real-time Fraud Detection at Scale - Integrating Real-Time Deep-Link Graph Analytics with Spark AI #UnifiedDataAnalytics #SparkAISummit
  • 3. Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.”
  • 4. Graph is HOW WE THINK 4#UnifiedDataAnalytics #SparkAISummit
  • 5. Common TigerGraph Use Cases 5 Improve Operational EfficiencyReduce Costs & Manage RisksIncrease Revenue • Recommendation Engine • Real-time Customer 360/ MDM • Product & Service Marketing • Fraud Detection • Anti-Money Laundering (AML) • Risk Assessment & Monitoring • Cyber Security • Enterprise Knowledge Graph • Network, IT and Cloud Resource Optimization • Energy Management System • Supply Chain Analysis Analyze all interactions in real-time to sell more Reduce costs and assess and monitor risks effectively Manage resources for maximum output Foundational Use Cases: Geospatial Analysis, Time Series Analysis, AI and Machine Learning
  • 6. 7 Key Data Science Capabilities Powered By a Native Parallel Graph Deep Link Analysis Relational Commonality Discovery and Computation From a set of entities (e.g. devices, customers, accounts, doctors), show all links or connections Given 2 entities (e.g. customers, businesses), follow their relationships to find commonality 6 Multi-dimensional Entity & Pattern Matching Given a pattern (e.g. referring business to a relative), find similar patterns in the graph Hub & Community Detection Find most influential members of a group (customers, doctors, citizens) & detect community around them Community 1 Community 2 1 32 4 5 Geospatial Graph Analysis Analyze changes in entities & relationships with location data A C A B Machine Learning Feature Generation & Explainable AI Extract graph-based features to feed as training data for machine learning; Power Explainable AI7 Temporal (Time-Series) Graph Analysis Analyze changes in entities & relationships over time Query Pattern P MatchB D
  • 7. Power Explainable AI with TigerGraph 7
  • 8. Why Spark + TigerGraph? +
  • 9. Spark + TigerGraph Data Pipeline 9
  • 10. Typical Spark + TigerGraph Integration ● Data Preparation and Integration (TigerGraph/Spark) ● Unsupervised Learning (TigerGraph) ● Feature Extraction for Supervised Learning (TigerGraph/Spark) ● Model Training (Spark) ● Validate and Apply Model (TigerGraph) ● Visualize and Explore Interconnected Data (TigerGraph) 10
  • 11. Machine Learning with TigerGraph China Mobile Anti-Fraud/Scam Detection
  • 12. 12 Real-Time Phone-Based Fraud Detection Massive, Worldwide Problem ● 18 Billion robocalls in US in 2017 (hiya.com) ● Spam/Scam - agile, spoofed numbers Customer: ● 600M subscribers ● 300M calls/day, peak 10K calls/sec ● Need: Real-time detection of various types of phone-based fraud
  • 13. Real-Time Phone Anti-Spam/Scam Detection 13 TigerGraph Solution: Real-time graph-based machine learning and decision system Graph Analytics ● Real-time machine learning ○ 118 graph features per call ○ Retrained periodically with 2M calls ● Real-time decisions ○ Call recipient sees alert if ML system says call is suspicious ● In production since Dec 2016 Graph Database ● 600M phone numbers (inside and outside network) ● 15B phone-phone call edges (2 month sliding window) ○ Time ○ Duration ● Real-time graph updates Peak 10K+ calls/sec ● 118 graph features per phone
  • 14. Examples of Graph Features for Machine Learning 14 Good Phone Features Bad Phone Features (1) Short term call duration (2) Empty stable group (3) No call back phone (4) Many rejected calls (5) Average distance > 3 Empty stable group Many rejected calls Average distance > 3 (1) High call back phone (2) Stable group (3) Long term phone (4) Many in-group connections (5) 3-step friend relation Stable group Many in- group connections Good Phone Features 3-step friend relation /// Good phone Bad phone X X X
  • 15. China Mobile - Detecting Phone-Based Fraud by Analyzing Network or Graph Pattern Features 15 • Each phone node has a fraud flag, indicating it’s a good phone or a bad phone and what type of fraud: scam, harassment, advertisement • Run real-time GSQL query for each call: ○ Collect 118 features ○ Compute composite score ○ Update fraud flag ○ Return fraud type Real-Time Call Event Caller Callee Time Call Detail Records Caller Callee Time Duration Query Continuous Graph Update Fraud Type
  • 16. Phone Fraud Real-Time Detection System phone vertex - fraud flag - expiration time target4 target3 - num of call - total duration - call date list - num of rejection target2 target1 ● 600 Million Vertices ● 15+ Billion Edges ● 300 Million Daily Updatesphone_phone
  • 17. Case 1: Call type was recently flagged Real-time Call Event Call Time Caller ID Callee ID If caller was recently flagged as “bad” If Caller is classified as “bad”Classifier Query Real-time Collect Caller’s Graph Features Update
  • 18. Case 2: Call needs to be classified Real-time Call Event Call Time Caller ID Callee ID If caller was recently flagged as “bad” If Caller is classified as “bad”Classifier Query Real-time Collect Caller’s Graph Features Update Input: list of calls with phone pairs and call time (batch) Output: 1. Call fraud type; 2. Scoring and feature vector of fraud calls for supporting evidence Explainable AI
  • 19. China Mobile Machine Learning Workflow 1. Data labels from police reports and online third party sources 2. A total of 118 graph features analyzed to build fraud detection model 3. All 118 graph features collected by one GSQL query 4. Training data’s features collected in GSQL in batch processing and stored as CSV file for future model training 5. TigerGraph performs fraud scoring with multiple Machine Learning models in real-time 6. Machine Learning models are trained offline and model parameters stored as configuration files for GSQL to use for real-time scoring (Future: Training ML models in Spark)
  • 20. Machine Learning with TigerGraph Real-time Scoring with Multiple ML models in GSQL Efficient EasyFast Real-time response for both feature collection and scoring Aggregation during traversal - multiple features in one Collect complex features without multiple RDBMS joins
  • 21. China Mobile Anti-Fraud Results from TigerGraph Machine Learning Solutions • 3.2 million fraud notifications in Shandong Province (Dec 2016 – July 2019) • Save potential loss • ~39.86 million RMB (~ 6 million US dollars)
  • 22. Why Spark + TigerGraph? +
  • 23. Why TigerGraph + Spark For Machine Learning? Parallel processing, distributed systems in training, ETL & feature collections Capture business moments with real- time response with explainable AI 23 Enrich machine learning with complex graph features AT SCALE ! AT SCALE ! AT SCALE !
  • 24. Spark and TigerGraph Data Pipeline Static Data Sources TigerGraph JDBC Driver Streaming Data Sources
  • 25. JDBC Driver (v1.2) ● Type 4 driver ● Support Read and Write bi-directional data flow to TigerGraph ● Read: Converts ResultSet to DataFrame ● Write: Load DataFrame and files to vertex/edge in TigerGraph ● Supports REST endpoints of built-in, compiled and interpreted GSQL queries from TigerGraph ● Open Source: ● https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc-driver
  • 26. DEMO Graph Feature Extraction from TigerGraph to Spark Via TigerGraph’s JDBC Driver 26
  • 27. Examples of Graph Features for Machine Learning 27 Good Phone Features Bad Phone Features (1) Short term call duration (2) Empty stable group (3) No call back phone (4) Many rejected calls (5) Average distance > 3 Empty stable group Many rejected calls Average distance > 3 (1) High call back phone (2) Stable group (3) Long term phone (4) Many in-group connections (5) 3-step friend relation Stable group Many in- group connections Good Phone Features 3-step friend relation /// Good phone Bad phone X X X
  • 28. Graph Features: Stable Group & InGroup Connection • Stable Group: phones in the target group that have regular calls (stable connection) with source phone • Stable InGroup Connections: phones in the target group that have regular calls (stable connection) among themselves Stable Connection defined as ● Has both call and callback ● Num of calls is larger than a given limit ● Total duration is larger than a given limit
  • 29. Resources • TigerGraph Cloud Machine Learning Starter Kit a. Register at tgcloud.us • JDBC Driver (Open Source) a. https://github.com/tigergraph/ecosys/tree/master/etl/tg-jdbc- driver • Contact me at emma.liu@tigergraph.com 29
  • 30. More … TigerGraph & Neural Network 30 Training data: https://www.coursera.org/learn/machine-learning Watch Graph Guru Episode 19 https://info.tigergraph.com/graph-gurus-19 Contact Me: emma.liu@tigergraph.com
  • 31. Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.” Realtime deep link graph analytics at scale is the differentiator to your machine learning pipeline!
  • 32. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 34. Stable Group Pseudocode Step 1: start from a given phone vertex, find its 1-step neighbors Step 2: check if a target has both stable outgoing (phone_phone) and stable incoming edges (phone_phone_reversed) source target4 target3 - num of call - total duration - call date list - num of rejection target2 target1 phone_phone phone_phone phone_phone_reversed Stable Connection defined as ● Has both call and callback ● Num of calls is larger than a given limit ● Total duration is larger than a given limit source
  • 35. Stable InGroup Connections Pseudocode Step 1: starting from a given phone vertex, find its 1-step neighbors (target group) Step 2: for each vertex in the target group, find its 1-step neighbors and check for stable connections Step 3: check the stable target for each vertex in the target group source target4 target3 - num of call - total duration - call date list - num of rejection target2 target1phone_phone phone_phone phone_phone_reversed source Stable Connection defined as ● Has both call and callback ● Num of calls is larger than a given limit ● Total duration is larger than a given limit