Blockchain technology represents a new paradigm for applications in fintech and beyond. It requires new approaches to age-old problems: How do you track criminals, derive BI insights and fight fraud in a blockchain-powered world? While the decentralized nature of the blockchain presents challenges to easily understanding the movement of money, the immutable, shared ledger also represents a tremendous opportunity for behavioral analytics at scale. As blockchain adoption continues to accelerate, organizations big and small need to understand the systems required to efficiently and quickly analyze the wealth of complex information available.
Since 2014, BlockCypher has been on the forefront of blockchain infrastructure. This session will describe how BlockCypher built machine-learning and graph traversal systems on Apache Spark to help both government organizations and private businesses stay informed in the brave new world of blockchain technology. The speakers will share their experiences and the lessons they learned combining these two bleeding-edge technologies, and how these techniques can be applied to private and federated chains.
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for Forensics, Fraud, and Finance with Karen Hsu
1. Spark, GraphX, and blockchains: Building a
behavioral analytics platform for forensics,
fraud, and finance
2. Agenda
● Why the blockchain and analytics (business)
● Blockchain data complexity
● Existing challenges
● New approach
● What to look for
● Use cases
6. BlockCypher Blockchain Web Services
IDENTITY
Data Endpoint
Webhooks Websockets
New Transaction
Endpoint
Address Wallet API
Multisig Transaction
TRANSACTION
Payment Forwarding
API
Confidence Factor API
Microtransaction API
Asset API
Contract API
Address Balance
Endpoint
ANALYTICS
Analytics Engines and
Parameters
Create/Get Analytics
Job
Anomaly and Fraud
Detection
13. The Problem Statement
How do you create a:
● Useful
● Decentralized
● Resilient
network?
01- What Makes a Blockchain
14. A Useful Network
● Founded on Cryptography to provide concrete, resilient ownership of tokens;
● Monetary policy built into the network
● A powerful scripting language to extend capabilities further:
○ Multisignature
○ Smart Contracts
01- What Makes a Blockchain
23. A Global Transaction Ledger
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
24. A Global Transaction Ledger
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
25. A Global Transaction Ledger
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
ID: Hash (PoW)
Previous Hash
Metadata
26. A “Block Chain”
01- What Makes a Blockchain
ID: Hash (PoW)
Previous Hash
Metadata
ID: Hash (PoW)
Previous Hash
Metadata
Hash (PoW)
vious Hash
tadata
Verifiable, including Proof-of-Work and all
consensus rules, back to genesis
27. Inputs and Outputs
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
Input
Input
Input
Output
Output
ID: Hash
Metadata
Input
Input
Input
Output
Output
ID: Hash
Metadata
28. Inputs and Outputs
Outputs and inputs include script: code executed by all fully validating nodes to
determine spending authority.
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
29. Pseudonymity in Bitcoin
Ownership identifiers are addresses:
● Most commonly owned by a single entity
● Sometimes a representation of not-yet-exposed code (script)
01- What Makes a Blockchain
30. When are Identities not Identities?
1. HD Wallets
01- What Makes a Blockchain
k1
k2
k3
kn
A1
A2
A3
An
31. When are Identities not Identities?
1. HD Wallets
2. Pay to Script Hash (output bound to script):
a. Multisignature
01- What Makes a Blockchain
A1
A2
A3
OP_2 [Pubkey] [Pubkey] [Pubkey] OP_3 OP_CHECKMULTISIG
32. When are Identities not Identities?
1. HD Wallets
2. Pay to Script Hash (output bound to script):
a. Multisignature
01- What Makes a Blockchain
A1
A2
A3
OP_2 [Pubkey] [Pubkey] [Pubkey] OP_3 OP_CHECKMULTISIG
H(x)
Aapparent
33. Current Tools are Lacking
Agents still use manual investigation and traversal
Traditional tools lack the flexibility to evolve alongside bitcoin’s usage patterns and
identity distributions
Behavioral Analytics bridges the gap: focus on patterns of movements, rather than
easily-manipulated properties
02- Review of Previous Work
34. You need a multi-layered solution
02- Review of Previous Work
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
35. Don’t discount the human element
02- Review of Previous Work
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
Shared Spending Authority
Known wallet/service behaviors
1bones2iaA7eijMrCUYYrY5xurwZNyYqU
36. Building a Practical Blockchain Analytics Platform
for Forensics
1. Representing the Graph
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
37. Graphs make natural blockchain representations...
Great for representing broader interactions, relationships, and trends
(plus they look cool)
03- Representing the Graph
38. But not all graphs are created equal.
Transformation to the Address-Address graph creates a loss of resolution and
introduces assumptions
03- Representing the Graph
39. This is especially true with Spark
03- Representing the Graph
Implication: Operations will
vary significantly in both
parallelism and data locality
based on edge versus
vertex property distinctions.
40. Different graphs serve different purposes...
Address-Address
Address-Tx
Output-Output
Input-Output
Transaction-Transaction
03- Representing the Graph
Tx1
Tx2
Tx4
Tx3
41. … and have different properties
03- Representing the Graph
Directed Acyclic Bipartite
Address/Address Y N N
Address/Transaction Y N Y
Input/Output Y Y Y
Output/Output Y Y N
Transaction/Transaction Y Y N
42. Why Spark + GraphX?
Single node graph databases great for performance re: traversal, etc.
But Spark+GraphX gives:
● One infrastructure for graphs, ML models, computation
● Access to graph properties in training, eg. in-degree
● Compute + Storage scalability
03- Representing the Graph
43. 2. Building the ML Pipeline
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
44. 04- Building the ML Pipeline
What you think will be a
problem
A Venn Diagram: Starting a ML Project
What will actually be a
problem
Algorithm Selection
Scaling
Data Quality
Feature Selection
Skynet Training Set
Determination
45. 04- Building the ML Pipeline
A simplifying recommendation
Start with unsupervised methods and iterate:
● Faster ramp-up
● Fewer parameters to tune
● Less overfitting risk
46. 04- Building the ML Pipeline
It begins and ends with data
Data
SourcesData
SourcesData
SourcesData
Sources
Feature Extraction
Selection,
Transformation,
Scaling
Model
Training &
Application
47. Feature extraction is harder than it looks...
Which features to include?
Balance risk of overfitting with need to capture relevant parameters.
04- Building the ML Pipeline
48. … but it all comes down to Var(x)
When considering a variable, ask two questions:
● What variance does this variable introduce to the system?
● Is this variance correlated to the result I want?
04- Building the ML Pipeline
49. Blockchain data is only one part of a larger picture
02- Bitcoin’s Data Set
Raw Blockchain Data
100GB (est.)
6MB/hr (approx.)
Denormalization
Network and Transient
Data
Derived Data
~1TB
50. Big graphs need lots of tuning
● Edge/Vertex distinctions
○ Balance Locality, Parallelism
● Partition Strategies
● Data Denormalization
○ Eg. Reducing number, value of transactions between addresses to properties
03- Representing the Graph
51. Blockchain data has some nuance...
Global consensus follows the chain with the most work:
04- Building the ML Pipeline
52. Blockchain data has some nuance...
Global consensus follows the chain with the most work:
04- Building the ML Pipeline
53. Blockchain data has some nuance...
Global consensus follows the chain with the most work:
04- Building the ML Pipeline
54. … but there are solutions.
A processing delay can address virtually all rewrite cases, but hinders real-time
analytics efforts.
The answer: Lambda Architecture
04- Building the ML Pipeline
55. Lambda Architecture
04- Building the ML Pipeline
Real-time Input
Batching
ML
Model
ML Pipeline Model Update
Feat.
Extract
Model
Application
Data
Transform
Result
56. Why it works: Aggregates are slow to move...
… but what is true in the aggregate cannot predict individual measurements.
04- Building the ML Pipeline
58. Lambda Architecture
05- Tx Clustering Example
Real-time Input
Batching
ML
Model
ML Pipeline Model Update
Feat.
Extract
Model
Application
Data
Transform
Result
59. Focus: Batching Process
05- Tx Clustering Example
Real-time Input
Batching
ML
Model
ML Pipeline Model Update
Feat.
Extract
Model
Application
Data
Transform
Result
60. Delay in processing addresses mutability concerns
05- Tx Clustering Example
Blockchain
Artificial
Delay
Processing
Latency
Trained Model
Coverage
Trained Model
Usage
time
61. Feature extraction from live node data
05- Tx Clustering Example
Input
Input
Input
Output
Output
ID: Hash
Metadata
{
~20 features, including:
● Transaction shape
(inputs, outputs)
● Value distribution
● Input types and ages
62. Cassandra Data store
Data is fed into a kmeans pipeline
07- Tx Clustering Example
Spark
Pull latest
Tx Batch
Tx
Feature
Extractor
Feature
Scaler
Inc.
Update
Model
63. Cassandra Data store
Spark 2.0 ML pipelines give you pipeline persistence
05- Tx Clustering Example
Spark
Pull latest
Tx Batch
Tx
Feature
Extractor
Feature
Scaler
Inc.
Update
Model
64. Cassandra Data store
Streaming pipeline reuses data components
05- Tx Clustering Example
Spark
Feature
Scaler
Tx
Feature
Extractor
Streaming input
Trained
ML Model
65. Cassandra Data store
Trained model used for real-time anomaly detection
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Tx Type &
Anomaly result
66. Cassandra Data store
Also used to color graphs for efficient traversal decisionmaking
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Transaction
Type Result
67. Cassandra Data store
Also used to color graphs for efficient traversal decisionmaking
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Transaction
Type Result
68. Cassandra Data store
Also used to color graphs for efficient traversal decisionmaking
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Transaction
Type Result
69. What to look for in Blockchain Analytics
❏ Infrastructure to ingest and transform metadata
❏ Multi-layered solution
❏ Graphs that address multiple purposes
❏ Resources for tuning graphs
❏ Real-time analytics (e.g. Lambda architecture)
70. Use Cases: Cybercrime
Ransomware $1B in 2016
Approaches
● Reactive - Follow the money
● Proactive - Anomaly detection
Source: FBI, SonicWall
72. Use Cases
Use Case(s) Industry
AML and financial crimes, regulatory compliance Financial Services
Provider/patient demographic data analysis, Claims fraud Healthcare
Mobile payments, Identity management Telecommunications
Supply chain analytics Manufacturing