Hardware Accelerated
Machine Learning Solution
for Detecting Fraud and
Money Laundering Rings
1
Sept 30, 2020
Victor Lee, TigerGraph
Kumar Deepak, Xilinx
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Our Presenters
2
Victor Lee
Head of Product Strategy &
Developer Relations
● BS in Electrical Engineering and
Computer Science from UC Berkeley,
MS in Electrical Engineering from
Stanford University
● PhD in Computer Science from Kent
State University focused on graph data
mining
● 20+ years in tech industry
Kumar Deepak
Distinguished Engineer
● B.S in Electronics and Communication
Engineering from Indian Institute of
Technology, Kharagpur.
● Leads Xilinx engineering efforts to
accelerate database and analytics
● 20+ years of experience in architecting
and developing large-scale complex
software and hardware systems
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
● How Graph Analytics provide better and faster insights
● How FPGAs amplify the speed and value of analytics
● Use Case: Fraud Detection and Money Laundering
- Finding Connected Communities for fraud detection
● How FPGAs work
● Louvain Modularity run on FPGA
● Benchmark
Agenda
3
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Graph-Powered Analytics & Machine Learning
Richer Data
● Relationships are 1st Class Citizens
● Connects different datasets and silos
Deeper Questions
● Look for semantic patterns of relationship
● Search far and wide more easily
Additional Computational Options
● Graph algorithms
● Graph-enhanced machine learning
Explainable Results
● Semantic data model, queries, and answers
● Visual exploration and results
4
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
The TigerGraph Difference
Feature Design Difference Benefit
Real-Time Deep-Link Querying ● Native Graph design
● C++ engine, for high performance
● Storage Architecture
● Uncovers hard-to-find patterns
● Operational, real-time
● HTAP: Transactions+Analytics
Handling Massive Scale ● Distributed DB architecture
● Massively parallel processing
● Compressed storage reduces
footprint and messaging
● Integrates all your data
● Automatic partitioning
● Elastic scaling of resource usage
In-Database Analytics ● GSQL: High-level yet
Turing-complete language
● User-extensible graph algorithm
library, runs in-DB
● ACID (OLTP) and Accumulators
(OLAP)
● Avoids transferring data
● Richer graph context
● In-DB machine learning
5 to 10+ hops deep
5
 
TigerGraph Platform: Deploy Anywhere
Graph Storage Engine (GSE) Graph Processing Engine (GPE)
Parallel Query
Processing
Data
Snapshots
GSQL
Queries
Visual
Design UI
RESTful
APIs
Input
Data
Operational Data
Master Data
DBs
Spark
Kafka
Files
Business
Intelligence
Analytics
Visualization
Dashboards
Reports
Data Warehouses
Master Data
Stores
Machine Learning
ETL Data Loader
User queries,
graph algorithms
GSQL
Server
Graph-
Studio
Server
Graph Data
Storage
ID ServiceIndexing
Message Queuing
(Spark / Kafka
Zookeeper)
RESTPP
 
TigerGraph Platform: Deploy Anywhere
Graph Storage Engine (GSE) Graph Processing Engine (GPE)
Parallel Query
Processing
Data
Snapshots
GSQL
Queries
Visual
Design UI
RESTful
APIs
Input
Data
Operational Data
Master Data
DBs
Spark
Kafka
Files
Business
Intelligence
Analytics
Visualization
Dashboards
Reports
Data Warehouses
Master Data
Stores
Machine Learning
ETL Data Loader
User queries,
graph algorithms
GSQL
Server
Graph-
Studio
Server
Graph Data
Storage
ID ServiceIndexing
Message Queuing
(Spark / Kafka
Zookeeper)
RESTPP
C++ UDF
on Alveo
 
TigerGraph + XILINX = faster, deeper, and wider insights.
Vertical
Markets
TigerGraph
Use Cases
XILINX Acceleration Customer Benefits
Healthcare Member Journey/
Customer 360
“Show similar members”
via Cosine Similarity
400X faster on Alveo U50
$150M/year call
center savings
Financial
Services
Anti-fraud/Anti-
Money Laundering
“Show fraud ring activity”
via Louvain Community Detection
~ 20X faster on Alveo U50 (WIP)
$500M credit card
fraud prevention
Manufacturing Supply Chain
Optimization
“Balance portfolio forecast”
Soon…
£400M supply chain
savings
 
9
GRAPH
Clustering
Betweenness
Similarity
Degree
Page Rank
Recommend
Shortest Path
Connected
Centrality
Detection
Machine
Learning
Graph
Convolutional
Networks (GCN)
Temporal
Pattern Detect
Louvain
Dependency
Networks (RPN)
Markov
Networks (RDN)
Probabilistic
Models (PRM)
Graph Specific Algorithms + ML
https://www.geeksforgeeks.org/graph-data-structure-and-algorithms/
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
● Sophisticated fraud is multi-step, multi-actor,
orchestrated
● Graph Algorithms & ML both provide valuable
detection and investigative capabilities
Fraud Detection with Graph-enhanced ML
10
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Shortest Path
• Is this entity closely connected
to known suspicious/risky
entities?
Graph Algorithms for Fraud Detection
11
Community Detection
• Narrow the focus of the
investigation
• How many high risk entities
are in the community?
Cycle Detection
• Is there a closed loop of related
entities where there
shouldn’t be (conflicts of
interest, etc.)?
• Is there a closed loop is
money flow (money
laundering)?
Other valuable algorithms: PageRank, Cosine Similarity, etc…
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
● Suppose we partition a graph into communities:
● Modularity score measures how good is a particular graph partition:
Mod ~ (% of edges that are in-group) minus
(expected % of in-group edges, if edges were randomized in a certain way)
● Task: Find the partitioning that has the highest modularity
● Challenge: Exponential number of possible partitionings
● Solution: Louvain is one of the fastest methods for modularity-based partitioning
Louvain Modularity Method for Community Detection
12
first try ⇒ Mod(case 1) better ⇒ Mod(case 2)
© Copyright 2020 Xilinx
• Logic blocks
• Look-up tables – combinatorial logic
• Flip flops – sequential logic
• DSP (Digital Signal Processing)
‒ Pre-adder, Multiplier, Accumulator
‒ And, OR, NOT, NAND, NOR, XOR, XNOR
‒ Pattern Detector
• Writable Memory
• LUTRAM (Look-up table RAM)
• BRAM (Block RAM)
• URAM (Ultra RAM)
• Communication
• I/O, Transceiver, PCIe, Ethernet
• Programmable Interconnect
What is an FPGA (Field Programmable Gate Array)?
Credit: https://towardsdatascience.com/introduction-to-fpga-and-its-architecture-20a62c14421c
LUTs: 1.2 M
Flip-Flops: 2.4M
Writable Memory: 47 MB
DSP Units: 6800
Xilinx VU9P FPGA has:
© Copyright 2020 Xilinx
Configuring an FPGA
Unprogrammed
configuration memory
Unconfigured
logic circuit
‘Programmed’
configuration memory
‘Configured’
logic circuit
Credit: ‘Bebop to the Boolean Boogie: An Unconventional Guide to
Electronics’
© Copyright 2020 Xilinx
>> 15
Computing Devices
CPU GPU FPGA ASIC
Example AMD EPYC
7702
NVIDIA A100 Xilinx Alveo U50 Google TPU
Architecture Instruction Set Instruction Set Domain Specific Domain Specific
Purpose General
Purpose
General
Purpose
Domain Specific Domain Specific
Workload Types Serialized
Workloads
Parallel
Workloads
Any workload Single Workload
Ease of
Programming
Easy Medium Medium No
programmability
Energy Efficiency Low Medium High Very High
© Copyright 2020 Xilinx
High-Performance FPGA Applications: Think “Parallel”
˃ Data-level parallelism
• Processing different blocks of a data set in parallel
˃ Task-level parallelism
• Executing different tasks in parallel
• Executing different tasks in a pipelined fashion
˃ Instruction-level parallelism
• Parallel instructions (superscalar)
• Pipelined instructions
˃ Bit-level parallelism
• Custom word width
funcCfuncB
funcA
funcD
© Copyright 2020 Xilinx
Using C, C++ or OpenCL to Program FPGAs
˃ Xilinx pioneered C to FPGA compilation technology (aka “HLS”) in 2011
˃ No need for low-level hardware description languages
˃ FPGAs are “Software Programmable”
loop_main:for(int j=0;j<NUM_SIMGROUPS;j+=2) {
loop_share:for(uint k=0;k<NUM_SIMS;k++) {
loop_parallel:for(int i=0;i<NUM_RNGS;i++) {
mt_rng[i].BOX_MULLER(&num1[i][k],&num2[i][k],ratio4,ratio3);
float payoff1 = expf(num1[i][k])-1.0f;
float payoff2 = expf(num2[i][k])-1.0f;
if(num1[i][k]>0.0f)
pCall1[i][k]+= payoff1;
else
pPut1[i][k]-=payoff1;
if(num2[i][k]>0.0f)
pCall2[i][k]+=payoff2;
else
pPut2[i][k]-=payoff2;
}
}
}
FPGAVitis Compiler (v++)
© Copyright 2020 Xilinx
Software Programmability: FPGA Development in C/C++
PCIe
x86 CPU
Host
Application
Runtime and Drivers
Acceleration API
FPGA
Accelerated
Functions
DMA Engine
AXI Interfaces
User
Application
Code
Xilinx
Acceleration
Platform
C/C++ code
Synthesizable
C/C++
GCC VITIS
© Copyright 2020 Xilinx
U50 U20
0
U28
0
U25
0
Cloud On-premise
Louvain
Modularity
(C++)
TigerGraph
Xilinx Accelerated TigerGraph
>> 19
Vitis core
development kit
compilers
BLAS
Library
Vitis accelerated
libraries
Vitis drivers & runtime (XRT)
analyzers debuggers
Vitis target platforms
Graph
Algorithms and
User Defined
Functions (UDFs)
© Copyright 2020 Xilinx
Coloring vertices
can relieve
dependencies
Louvain Modularity Algorithm
˃ Measurement of Modularity Q: judgement of stability of current
clustering
˃ Simple judgement for moving a node: ΔQ : judgement for job
hopping(move) for a vertice
˃ main challenge: Integrating large-size variables by scanning graph as
input
Fig. 2 Parallel Louvain Algorithm flow
The algorithm is like a group of people clustering and then
job hopping until stable
Cid,
TOT,
cSize
Get
Cid[v{e}]
Find Best
Target
Update
>> 20
Cid,
TOT,
cSiz
e
Get
Cid[v{e}]
Find Best
Target
Update
Cid,
TOT,
cSiz
e
Building
-Phases
• Merged to
smaller graph
Coloring Coloring
Clustering:
No more clustering
happen
Phase-1 Phase-2 Phase-n
Same-color
vertices’
distance >1
No need for
coloring small
graph
Clustering:
• Iterating until
ΔQ small
enough
• Q: modularity
• 1 iteration will
scan all
vertices
• 1st
Phase
always take
most of
time(>80%)
Clustering:
The smaller the
graph,
the fewer the
vertices,
the faster the
iteration
Building
-Phases
>90%
workload
can be
accelerated
Input
graph
(Done )
For
FPGA
(… )
© Copyright 2020 Xilinx
Benchmark: Louvain Modularity for 50M nodes network
Xilinx Alveo U50 PCIE Accelerator Card
8GB HBM, 75W
Dataset: europe_osm, Number of vertices: 50912018, Number of edges: 54054660
Nimbix Cloud
© Copyright 2020 Xilinx
Demo: Louvain Modularity for 50M nodes network
© Copyright 2020 Xilinx
Time (seconds) to calculate Louvain Modularity
20x faster than CPU
Using one Alveo U50
 
| GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Thank You!
● Contact Us
○ TigerGraph: Victor Lee, victor@tigergraph.com
○ Xilinx: Dan Eaton, daniele@xilinx.com
Q&A
24

Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money Laundering Rings

  • 1.
      Hardware Accelerated Machine LearningSolution for Detecting Fraud and Money Laundering Rings 1 Sept 30, 2020 Victor Lee, TigerGraph Kumar Deepak, Xilinx
  • 2.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | Our Presenters 2 Victor Lee Head of Product Strategy & Developer Relations ● BS in Electrical Engineering and Computer Science from UC Berkeley, MS in Electrical Engineering from Stanford University ● PhD in Computer Science from Kent State University focused on graph data mining ● 20+ years in tech industry Kumar Deepak Distinguished Engineer ● B.S in Electronics and Communication Engineering from Indian Institute of Technology, Kharagpur. ● Leads Xilinx engineering efforts to accelerate database and analytics ● 20+ years of experience in architecting and developing large-scale complex software and hardware systems
  • 3.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | ● How Graph Analytics provide better and faster insights ● How FPGAs amplify the speed and value of analytics ● Use Case: Fraud Detection and Money Laundering - Finding Connected Communities for fraud detection ● How FPGAs work ● Louvain Modularity run on FPGA ● Benchmark Agenda 3
  • 4.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | Graph-Powered Analytics & Machine Learning Richer Data ● Relationships are 1st Class Citizens ● Connects different datasets and silos Deeper Questions ● Look for semantic patterns of relationship ● Search far and wide more easily Additional Computational Options ● Graph algorithms ● Graph-enhanced machine learning Explainable Results ● Semantic data model, queries, and answers ● Visual exploration and results 4
  • 5.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | The TigerGraph Difference Feature Design Difference Benefit Real-Time Deep-Link Querying ● Native Graph design ● C++ engine, for high performance ● Storage Architecture ● Uncovers hard-to-find patterns ● Operational, real-time ● HTAP: Transactions+Analytics Handling Massive Scale ● Distributed DB architecture ● Massively parallel processing ● Compressed storage reduces footprint and messaging ● Integrates all your data ● Automatic partitioning ● Elastic scaling of resource usage In-Database Analytics ● GSQL: High-level yet Turing-complete language ● User-extensible graph algorithm library, runs in-DB ● ACID (OLTP) and Accumulators (OLAP) ● Avoids transferring data ● Richer graph context ● In-DB machine learning 5 to 10+ hops deep 5
  • 6.
      TigerGraph Platform: DeployAnywhere Graph Storage Engine (GSE) Graph Processing Engine (GPE) Parallel Query Processing Data Snapshots GSQL Queries Visual Design UI RESTful APIs Input Data Operational Data Master Data DBs Spark Kafka Files Business Intelligence Analytics Visualization Dashboards Reports Data Warehouses Master Data Stores Machine Learning ETL Data Loader User queries, graph algorithms GSQL Server Graph- Studio Server Graph Data Storage ID ServiceIndexing Message Queuing (Spark / Kafka Zookeeper) RESTPP
  • 7.
      TigerGraph Platform: DeployAnywhere Graph Storage Engine (GSE) Graph Processing Engine (GPE) Parallel Query Processing Data Snapshots GSQL Queries Visual Design UI RESTful APIs Input Data Operational Data Master Data DBs Spark Kafka Files Business Intelligence Analytics Visualization Dashboards Reports Data Warehouses Master Data Stores Machine Learning ETL Data Loader User queries, graph algorithms GSQL Server Graph- Studio Server Graph Data Storage ID ServiceIndexing Message Queuing (Spark / Kafka Zookeeper) RESTPP C++ UDF on Alveo
  • 8.
      TigerGraph + XILINX= faster, deeper, and wider insights. Vertical Markets TigerGraph Use Cases XILINX Acceleration Customer Benefits Healthcare Member Journey/ Customer 360 “Show similar members” via Cosine Similarity 400X faster on Alveo U50 $150M/year call center savings Financial Services Anti-fraud/Anti- Money Laundering “Show fraud ring activity” via Louvain Community Detection ~ 20X faster on Alveo U50 (WIP) $500M credit card fraud prevention Manufacturing Supply Chain Optimization “Balance portfolio forecast” Soon… £400M supply chain savings
  • 9.
      9 GRAPH Clustering Betweenness Similarity Degree Page Rank Recommend Shortest Path Connected Centrality Detection Machine Learning Graph Convolutional Networks(GCN) Temporal Pattern Detect Louvain Dependency Networks (RPN) Markov Networks (RDN) Probabilistic Models (PRM) Graph Specific Algorithms + ML https://www.geeksforgeeks.org/graph-data-structure-and-algorithms/
  • 10.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | ● Sophisticated fraud is multi-step, multi-actor, orchestrated ● Graph Algorithms & ML both provide valuable detection and investigative capabilities Fraud Detection with Graph-enhanced ML 10
  • 11.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | Shortest Path • Is this entity closely connected to known suspicious/risky entities? Graph Algorithms for Fraud Detection 11 Community Detection • Narrow the focus of the investigation • How many high risk entities are in the community? Cycle Detection • Is there a closed loop of related entities where there shouldn’t be (conflicts of interest, etc.)? • Is there a closed loop is money flow (money laundering)? Other valuable algorithms: PageRank, Cosine Similarity, etc…
  • 12.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | ● Suppose we partition a graph into communities: ● Modularity score measures how good is a particular graph partition: Mod ~ (% of edges that are in-group) minus (expected % of in-group edges, if edges were randomized in a certain way) ● Task: Find the partitioning that has the highest modularity ● Challenge: Exponential number of possible partitionings ● Solution: Louvain is one of the fastest methods for modularity-based partitioning Louvain Modularity Method for Community Detection 12 first try ⇒ Mod(case 1) better ⇒ Mod(case 2)
  • 13.
    © Copyright 2020Xilinx • Logic blocks • Look-up tables – combinatorial logic • Flip flops – sequential logic • DSP (Digital Signal Processing) ‒ Pre-adder, Multiplier, Accumulator ‒ And, OR, NOT, NAND, NOR, XOR, XNOR ‒ Pattern Detector • Writable Memory • LUTRAM (Look-up table RAM) • BRAM (Block RAM) • URAM (Ultra RAM) • Communication • I/O, Transceiver, PCIe, Ethernet • Programmable Interconnect What is an FPGA (Field Programmable Gate Array)? Credit: https://towardsdatascience.com/introduction-to-fpga-and-its-architecture-20a62c14421c LUTs: 1.2 M Flip-Flops: 2.4M Writable Memory: 47 MB DSP Units: 6800 Xilinx VU9P FPGA has:
  • 14.
    © Copyright 2020Xilinx Configuring an FPGA Unprogrammed configuration memory Unconfigured logic circuit ‘Programmed’ configuration memory ‘Configured’ logic circuit Credit: ‘Bebop to the Boolean Boogie: An Unconventional Guide to Electronics’
  • 15.
    © Copyright 2020Xilinx >> 15 Computing Devices CPU GPU FPGA ASIC Example AMD EPYC 7702 NVIDIA A100 Xilinx Alveo U50 Google TPU Architecture Instruction Set Instruction Set Domain Specific Domain Specific Purpose General Purpose General Purpose Domain Specific Domain Specific Workload Types Serialized Workloads Parallel Workloads Any workload Single Workload Ease of Programming Easy Medium Medium No programmability Energy Efficiency Low Medium High Very High
  • 16.
    © Copyright 2020Xilinx High-Performance FPGA Applications: Think “Parallel” ˃ Data-level parallelism • Processing different blocks of a data set in parallel ˃ Task-level parallelism • Executing different tasks in parallel • Executing different tasks in a pipelined fashion ˃ Instruction-level parallelism • Parallel instructions (superscalar) • Pipelined instructions ˃ Bit-level parallelism • Custom word width funcCfuncB funcA funcD
  • 17.
    © Copyright 2020Xilinx Using C, C++ or OpenCL to Program FPGAs ˃ Xilinx pioneered C to FPGA compilation technology (aka “HLS”) in 2011 ˃ No need for low-level hardware description languages ˃ FPGAs are “Software Programmable” loop_main:for(int j=0;j<NUM_SIMGROUPS;j+=2) { loop_share:for(uint k=0;k<NUM_SIMS;k++) { loop_parallel:for(int i=0;i<NUM_RNGS;i++) { mt_rng[i].BOX_MULLER(&num1[i][k],&num2[i][k],ratio4,ratio3); float payoff1 = expf(num1[i][k])-1.0f; float payoff2 = expf(num2[i][k])-1.0f; if(num1[i][k]>0.0f) pCall1[i][k]+= payoff1; else pPut1[i][k]-=payoff1; if(num2[i][k]>0.0f) pCall2[i][k]+=payoff2; else pPut2[i][k]-=payoff2; } } } FPGAVitis Compiler (v++)
  • 18.
    © Copyright 2020Xilinx Software Programmability: FPGA Development in C/C++ PCIe x86 CPU Host Application Runtime and Drivers Acceleration API FPGA Accelerated Functions DMA Engine AXI Interfaces User Application Code Xilinx Acceleration Platform C/C++ code Synthesizable C/C++ GCC VITIS
  • 19.
    © Copyright 2020Xilinx U50 U20 0 U28 0 U25 0 Cloud On-premise Louvain Modularity (C++) TigerGraph Xilinx Accelerated TigerGraph >> 19 Vitis core development kit compilers BLAS Library Vitis accelerated libraries Vitis drivers & runtime (XRT) analyzers debuggers Vitis target platforms Graph Algorithms and User Defined Functions (UDFs)
  • 20.
    © Copyright 2020Xilinx Coloring vertices can relieve dependencies Louvain Modularity Algorithm ˃ Measurement of Modularity Q: judgement of stability of current clustering ˃ Simple judgement for moving a node: ΔQ : judgement for job hopping(move) for a vertice ˃ main challenge: Integrating large-size variables by scanning graph as input Fig. 2 Parallel Louvain Algorithm flow The algorithm is like a group of people clustering and then job hopping until stable Cid, TOT, cSize Get Cid[v{e}] Find Best Target Update >> 20 Cid, TOT, cSiz e Get Cid[v{e}] Find Best Target Update Cid, TOT, cSiz e Building -Phases • Merged to smaller graph Coloring Coloring Clustering: No more clustering happen Phase-1 Phase-2 Phase-n Same-color vertices’ distance >1 No need for coloring small graph Clustering: • Iterating until ΔQ small enough • Q: modularity • 1 iteration will scan all vertices • 1st Phase always take most of time(>80%) Clustering: The smaller the graph, the fewer the vertices, the faster the iteration Building -Phases >90% workload can be accelerated Input graph (Done ) For FPGA (… )
  • 21.
    © Copyright 2020Xilinx Benchmark: Louvain Modularity for 50M nodes network Xilinx Alveo U50 PCIE Accelerator Card 8GB HBM, 75W Dataset: europe_osm, Number of vertices: 50912018, Number of edges: 54054660 Nimbix Cloud
  • 22.
    © Copyright 2020Xilinx Demo: Louvain Modularity for 50M nodes network
  • 23.
    © Copyright 2020Xilinx Time (seconds) to calculate Louvain Modularity 20x faster than CPU Using one Alveo U50
  • 24.
      | GRAPHAIWORLD.COM |#GRAPHAIWORLD | Thank You! ● Contact Us ○ TigerGraph: Victor Lee, victor@tigergraph.com ○ Xilinx: Dan Eaton, daniele@xilinx.com Q&A 24