SlideShare a Scribd company logo
1 of 75
Download to read offline
Spark, GraphX, and blockchains: Building a
behavioral analytics platform for forensics,
fraud, and finance
Agenda
● Why the blockchain and analytics (business)
● Blockchain data complexity
● Existing challenges
● New approach
● What to look for
● Use cases
BlockCypher Background
CRYPTIV
BlockCypher: Blockchain Web Services
Our Customers
Who We Are
BlockCypher Blockchain Web Services
IDENTITY
Data Endpoint
Webhooks Websockets
New Transaction
Endpoint
Address Wallet API
Multisig Transaction
TRANSACTION
Payment Forwarding
API
Confidence Factor API
Microtransaction API
Asset API
Contract API
Address Balance
Endpoint
ANALYTICS
Analytics Engines and
Parameters
Create/Get Analytics
Job
Anomaly and Fraud
Detection
Outsource Infrastructure
BlockCypher Allows You to...
6+
months
less time
35%+
less
costs
Why the blockchain and
blockchain analytics?
Why the Blockchain?
Decentralized
Transactions
Transparency
Security
Why Blockchain Analytics?
Market
Adoption
Regulation
7x+
<2y
Why Blockchain Analytics?
Technology Challenges
● data searching
● high-speed access to
data for analytics
Blockchain Today
What makes a blockchain?
(the 5 minute version)
The Problem Statement
How do you create a:
● Useful
● Decentralized
● Resilient
network?
01- What Makes a Blockchain
A Useful Network
● Founded on Cryptography to provide concrete, resilient ownership of tokens;
● Monetary policy built into the network
● A powerful scripting language to extend capabilities further:
○ Multisignature
○ Smart Contracts
01- What Makes a Blockchain
A Decentralized Network
01- What Makes a Blockchain
A Decentralized Network
01- What Makes a Blockchain
A Decentralized Network
01- What Makes a Blockchain
A Decentralized Network
01- What Makes a Blockchain
A Decentralized Network
01- What Makes a Blockchain
A Decentralized Network
01- What Makes a Blockchain
A Resilient Network
The Byzantine Generals Problem
01- What Makes a Blockchain
Consensus… But of What?
A Global Transaction Ledger
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
A Global Transaction Ledger
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
A Global Transaction Ledger
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
ID: Hash (PoW)
Previous Hash
Metadata
A “Block Chain”
01- What Makes a Blockchain
ID: Hash (PoW)
Previous Hash
Metadata
ID: Hash (PoW)
Previous Hash
Metadata
Hash (PoW)
vious Hash
tadata
Verifiable, including Proof-of-Work and all
consensus rules, back to genesis
Inputs and Outputs
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
Input
Input
Input
Output
Output
ID: Hash
Metadata
Input
Input
Input
Output
Output
ID: Hash
Metadata
Inputs and Outputs
Outputs and inputs include script: code executed by all fully validating nodes to
determine spending authority.
01- What Makes a Blockchain
Input
Input
Input
Output
Output
ID: Hash
Metadata
Pseudonymity in Bitcoin
Ownership identifiers are addresses:
● Most commonly owned by a single entity
● Sometimes a representation of not-yet-exposed code (script)
01- What Makes a Blockchain
When are Identities not Identities?
1. HD Wallets
01- What Makes a Blockchain
k1
k2
k3
kn
A1
A2
A3
An
When are Identities not Identities?
1. HD Wallets
2. Pay to Script Hash (output bound to script):
a. Multisignature
01- What Makes a Blockchain
A1
A2
A3
OP_2 [Pubkey] [Pubkey] [Pubkey] OP_3 OP_CHECKMULTISIG
When are Identities not Identities?
1. HD Wallets
2. Pay to Script Hash (output bound to script):
a. Multisignature
01- What Makes a Blockchain
A1
A2
A3
OP_2 [Pubkey] [Pubkey] [Pubkey] OP_3 OP_CHECKMULTISIG
H(x)
Aapparent
Current Tools are Lacking
Agents still use manual investigation and traversal
Traditional tools lack the flexibility to evolve alongside bitcoin’s usage patterns and
identity distributions
Behavioral Analytics bridges the gap: focus on patterns of movements, rather than
easily-manipulated properties
02- Review of Previous Work
You need a multi-layered solution
02- Review of Previous Work
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
Don’t discount the human element
02- Review of Previous Work
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
Shared Spending Authority
Known wallet/service behaviors
1bones2iaA7eijMrCUYYrY5xurwZNyYqU
Building a Practical Blockchain Analytics Platform
for Forensics
1. Representing the Graph
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
Graphs make natural blockchain representations...
Great for representing broader interactions, relationships, and trends
(plus they look cool)
03- Representing the Graph
But not all graphs are created equal.
Transformation to the Address-Address graph creates a loss of resolution and
introduces assumptions
03- Representing the Graph
This is especially true with Spark
03- Representing the Graph
Implication: Operations will
vary significantly in both
parallelism and data locality
based on edge versus
vertex property distinctions.
Different graphs serve different purposes...
Address-Address
Address-Tx
Output-Output
Input-Output
Transaction-Transaction
03- Representing the Graph
Tx1
Tx2
Tx4
Tx3
… and have different properties
03- Representing the Graph
Directed Acyclic Bipartite
Address/Address Y N N
Address/Transaction Y N Y
Input/Output Y Y Y
Output/Output Y Y N
Transaction/Transaction Y Y N
Why Spark + GraphX?
Single node graph databases great for performance re: traversal, etc.
But Spark+GraphX gives:
● One infrastructure for graphs, ML models, computation
● Access to graph properties in training, eg. in-degree
● Compute + Storage scalability
03- Representing the Graph
2. Building the ML Pipeline
Structured
Queries
Graph
Traversal
Hand-tuned
Heuristics
Machine
Learning
04- Building the ML Pipeline
What you think will be a
problem
A Venn Diagram: Starting a ML Project
What will actually be a
problem
Algorithm Selection
Scaling
Data Quality
Feature Selection
Skynet Training Set
Determination
04- Building the ML Pipeline
A simplifying recommendation
Start with unsupervised methods and iterate:
● Faster ramp-up
● Fewer parameters to tune
● Less overfitting risk
04- Building the ML Pipeline
It begins and ends with data
Data
SourcesData
SourcesData
SourcesData
Sources
Feature Extraction
Selection,
Transformation,
Scaling
Model
Training &
Application
Feature extraction is harder than it looks...
Which features to include?
Balance risk of overfitting with need to capture relevant parameters.
04- Building the ML Pipeline
… but it all comes down to Var(x)
When considering a variable, ask two questions:
● What variance does this variable introduce to the system?
● Is this variance correlated to the result I want?
04- Building the ML Pipeline
Blockchain data is only one part of a larger picture
02- Bitcoin’s Data Set
Raw Blockchain Data
100GB (est.)
6MB/hr (approx.)
Denormalization
Network and Transient
Data
Derived Data
~1TB
Big graphs need lots of tuning
● Edge/Vertex distinctions
○ Balance Locality, Parallelism
● Partition Strategies
● Data Denormalization
○ Eg. Reducing number, value of transactions between addresses to properties
03- Representing the Graph
Blockchain data has some nuance...
Global consensus follows the chain with the most work:
04- Building the ML Pipeline
Blockchain data has some nuance...
Global consensus follows the chain with the most work:
04- Building the ML Pipeline
Blockchain data has some nuance...
Global consensus follows the chain with the most work:
04- Building the ML Pipeline
… but there are solutions.
A processing delay can address virtually all rewrite cases, but hinders real-time
analytics efforts.
The answer: Lambda Architecture
04- Building the ML Pipeline
Lambda Architecture
04- Building the ML Pipeline
Real-time Input
Batching
ML
Model
ML Pipeline Model Update
Feat.
Extract
Model
Application
Data
Transform
Result
Why it works: Aggregates are slow to move...
… but what is true in the aggregate cannot predict individual measurements.
04- Building the ML Pipeline
Example: Tx Clustering and Anomaly
Detection
Lambda Architecture
05- Tx Clustering Example
Real-time Input
Batching
ML
Model
ML Pipeline Model Update
Feat.
Extract
Model
Application
Data
Transform
Result
Focus: Batching Process
05- Tx Clustering Example
Real-time Input
Batching
ML
Model
ML Pipeline Model Update
Feat.
Extract
Model
Application
Data
Transform
Result
Delay in processing addresses mutability concerns
05- Tx Clustering Example
Blockchain
Artificial
Delay
Processing
Latency
Trained Model
Coverage
Trained Model
Usage
time
Feature extraction from live node data
05- Tx Clustering Example
Input
Input
Input
Output
Output
ID: Hash
Metadata
{
~20 features, including:
● Transaction shape
(inputs, outputs)
● Value distribution
● Input types and ages
Cassandra Data store
Data is fed into a kmeans pipeline
07- Tx Clustering Example
Spark
Pull latest
Tx Batch
Tx
Feature
Extractor
Feature
Scaler
Inc.
Update
Model
Cassandra Data store
Spark 2.0 ML pipelines give you pipeline persistence
05- Tx Clustering Example
Spark
Pull latest
Tx Batch
Tx
Feature
Extractor
Feature
Scaler
Inc.
Update
Model
Cassandra Data store
Streaming pipeline reuses data components
05- Tx Clustering Example
Spark
Feature
Scaler
Tx
Feature
Extractor
Streaming input
Trained
ML Model
Cassandra Data store
Trained model used for real-time anomaly detection
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Tx Type &
Anomaly result
Cassandra Data store
Also used to color graphs for efficient traversal decisionmaking
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Transaction
Type Result
Cassandra Data store
Also used to color graphs for efficient traversal decisionmaking
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Transaction
Type Result
Cassandra Data store
Also used to color graphs for efficient traversal decisionmaking
05- Tx Clustering Example
Feature
Scaler
Tx
Feature
Extractor
Trained
ML Model
Streaming input
Transaction
Type Result
What to look for in Blockchain Analytics
❏ Infrastructure to ingest and transform metadata
❏ Multi-layered solution
❏ Graphs that address multiple purposes
❏ Resources for tuning graphs
❏ Real-time analytics (e.g. Lambda architecture)
Use Cases: Cybercrime
Ransomware $1B in 2016
Approaches
● Reactive - Follow the money
● Proactive - Anomaly detection
Source: FBI, SonicWall
Demo
Use Cases
Use Case(s) Industry
AML and financial crimes, regulatory compliance Financial Services
Provider/patient demographic data analysis, Claims fraud Healthcare
Mobile payments, Identity management Telecommunications
Supply chain analytics Manufacturing
www.blockcypher.com
@blockcypher
Data Transformation Machine Learning Modeling
Private Chain
...

More Related Content

What's hot

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 

What's hot (20)

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
Accelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks RuntimeAccelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks Runtime
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 

Viewers also liked

Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
Melanie Swan
 
Future of AI: Blockchain and Deep Learning
Future of AI: Blockchain and Deep LearningFuture of AI: Blockchain and Deep Learning
Future of AI: Blockchain and Deep Learning
Melanie Swan
 

Viewers also liked (20)

Blockchain Cloudminds: Human-Machine Pooled-Mind DACs
Blockchain Cloudminds: Human-Machine Pooled-Mind DACsBlockchain Cloudminds: Human-Machine Pooled-Mind DACs
Blockchain Cloudminds: Human-Machine Pooled-Mind DACs
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
 
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
 
Blockchain Economic Theory
Blockchain Economic TheoryBlockchain Economic Theory
Blockchain Economic Theory
 
Blockchain Payment Channels Explained
Blockchain Payment Channels ExplainedBlockchain Payment Channels Explained
Blockchain Payment Channels Explained
 
Artificial Intelligence & Blockchain Synergy
Artificial Intelligence & Blockchain SynergyArtificial Intelligence & Blockchain Synergy
Artificial Intelligence & Blockchain Synergy
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務
 
Blockchain Consensus Protocols
Blockchain Consensus ProtocolsBlockchain Consensus Protocols
Blockchain Consensus Protocols
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法
 
Future of AI: Blockchain and Deep Learning
Future of AI: Blockchain and Deep LearningFuture of AI: Blockchain and Deep Learning
Future of AI: Blockchain and Deep Learning
 
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
 
[系列活動] Python爬蟲實戰
[系列活動] Python爬蟲實戰[系列活動] Python爬蟲實戰
[系列活動] Python爬蟲實戰
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業
 
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路
 
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
 
給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班
 

Similar to Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for Forensics, Fraud, and Finance with Karen Hsu

Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Michael Noll
 

Similar to Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for Forensics, Fraud, and Finance with Karen Hsu (20)

A Practical Deep Dive into Observability of Streaming Applications with Kosta...
A Practical Deep Dive into Observability of Streaming Applications with Kosta...A Practical Deep Dive into Observability of Streaming Applications with Kosta...
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
 
New Business Models enabled by Blockchain
New Business Models enabled by BlockchainNew Business Models enabled by Blockchain
New Business Models enabled by Blockchain
 
MLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
MLOps journey at Swisscom: AI Use Cases, Architecture and Future VisionMLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
MLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
 
2017 Microservices Practitioner Virtual Summit: Microservices at Squarespace ...
2017 Microservices Practitioner Virtual Summit: Microservices at Squarespace ...2017 Microservices Practitioner Virtual Summit: Microservices at Squarespace ...
2017 Microservices Practitioner Virtual Summit: Microservices at Squarespace ...
 
How to mutate your immutable log | Andrey Falko, Stripe
How to mutate your immutable log | Andrey Falko, StripeHow to mutate your immutable log | Andrey Falko, Stripe
How to mutate your immutable log | Andrey Falko, Stripe
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
 
IBM Blockchain Platform - Architectural Good Practices v1.0
IBM Blockchain Platform - Architectural Good Practices v1.0IBM Blockchain Platform - Architectural Good Practices v1.0
IBM Blockchain Platform - Architectural Good Practices v1.0
 
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake EditionFrom BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
J.burke HackMiami6
J.burke HackMiami6J.burke HackMiami6
J.burke HackMiami6
 
New Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQLNew Approaches for Fraud Detection on Apache Kafka and KSQL
New Approaches for Fraud Detection on Apache Kafka and KSQL
 
IBM presents: Hyperledger Fabric Hands On Workshop - part 1
IBM presents: Hyperledger Fabric Hands On Workshop - part 1IBM presents: Hyperledger Fabric Hands On Workshop - part 1
IBM presents: Hyperledger Fabric Hands On Workshop - part 1
 
Microservices at ibotta pitfalls and learnings
Microservices at ibotta pitfalls and learningsMicroservices at ibotta pitfalls and learnings
Microservices at ibotta pitfalls and learnings
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
 
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
 
Using bluemix predictive analytics service in Node-RED
Using bluemix predictive analytics service in Node-REDUsing bluemix predictive analytics service in Node-RED
Using bluemix predictive analytics service in Node-RED
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 

Recently uploaded (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for Forensics, Fraud, and Finance with Karen Hsu

  • 1. Spark, GraphX, and blockchains: Building a behavioral analytics platform for forensics, fraud, and finance
  • 2. Agenda ● Why the blockchain and analytics (business) ● Blockchain data complexity ● Existing challenges ● New approach ● What to look for ● Use cases
  • 5. BlockCypher: Blockchain Web Services Our Customers Who We Are
  • 6. BlockCypher Blockchain Web Services IDENTITY Data Endpoint Webhooks Websockets New Transaction Endpoint Address Wallet API Multisig Transaction TRANSACTION Payment Forwarding API Confidence Factor API Microtransaction API Asset API Contract API Address Balance Endpoint ANALYTICS Analytics Engines and Parameters Create/Get Analytics Job Anomaly and Fraud Detection
  • 7. Outsource Infrastructure BlockCypher Allows You to... 6+ months less time 35%+ less costs
  • 8. Why the blockchain and blockchain analytics?
  • 11. Why Blockchain Analytics? Technology Challenges ● data searching ● high-speed access to data for analytics Blockchain Today
  • 12. What makes a blockchain? (the 5 minute version)
  • 13. The Problem Statement How do you create a: ● Useful ● Decentralized ● Resilient network? 01- What Makes a Blockchain
  • 14. A Useful Network ● Founded on Cryptography to provide concrete, resilient ownership of tokens; ● Monetary policy built into the network ● A powerful scripting language to extend capabilities further: ○ Multisignature ○ Smart Contracts 01- What Makes a Blockchain
  • 15. A Decentralized Network 01- What Makes a Blockchain
  • 16. A Decentralized Network 01- What Makes a Blockchain
  • 17. A Decentralized Network 01- What Makes a Blockchain
  • 18. A Decentralized Network 01- What Makes a Blockchain
  • 19. A Decentralized Network 01- What Makes a Blockchain
  • 20. A Decentralized Network 01- What Makes a Blockchain
  • 21. A Resilient Network The Byzantine Generals Problem 01- What Makes a Blockchain
  • 23. A Global Transaction Ledger 01- What Makes a Blockchain Input Input Input Output Output ID: Hash Metadata
  • 24. A Global Transaction Ledger 01- What Makes a Blockchain Input Input Input Output Output ID: Hash Metadata
  • 25. A Global Transaction Ledger 01- What Makes a Blockchain Input Input Input Output Output ID: Hash Metadata ID: Hash (PoW) Previous Hash Metadata
  • 26. A “Block Chain” 01- What Makes a Blockchain ID: Hash (PoW) Previous Hash Metadata ID: Hash (PoW) Previous Hash Metadata Hash (PoW) vious Hash tadata Verifiable, including Proof-of-Work and all consensus rules, back to genesis
  • 27. Inputs and Outputs 01- What Makes a Blockchain Input Input Input Output Output ID: Hash Metadata Input Input Input Output Output ID: Hash Metadata Input Input Input Output Output ID: Hash Metadata
  • 28. Inputs and Outputs Outputs and inputs include script: code executed by all fully validating nodes to determine spending authority. 01- What Makes a Blockchain Input Input Input Output Output ID: Hash Metadata
  • 29. Pseudonymity in Bitcoin Ownership identifiers are addresses: ● Most commonly owned by a single entity ● Sometimes a representation of not-yet-exposed code (script) 01- What Makes a Blockchain
  • 30. When are Identities not Identities? 1. HD Wallets 01- What Makes a Blockchain k1 k2 k3 kn A1 A2 A3 An
  • 31. When are Identities not Identities? 1. HD Wallets 2. Pay to Script Hash (output bound to script): a. Multisignature 01- What Makes a Blockchain A1 A2 A3 OP_2 [Pubkey] [Pubkey] [Pubkey] OP_3 OP_CHECKMULTISIG
  • 32. When are Identities not Identities? 1. HD Wallets 2. Pay to Script Hash (output bound to script): a. Multisignature 01- What Makes a Blockchain A1 A2 A3 OP_2 [Pubkey] [Pubkey] [Pubkey] OP_3 OP_CHECKMULTISIG H(x) Aapparent
  • 33. Current Tools are Lacking Agents still use manual investigation and traversal Traditional tools lack the flexibility to evolve alongside bitcoin’s usage patterns and identity distributions Behavioral Analytics bridges the gap: focus on patterns of movements, rather than easily-manipulated properties 02- Review of Previous Work
  • 34. You need a multi-layered solution 02- Review of Previous Work Structured Queries Graph Traversal Hand-tuned Heuristics Machine Learning
  • 35. Don’t discount the human element 02- Review of Previous Work Structured Queries Graph Traversal Hand-tuned Heuristics Machine Learning Shared Spending Authority Known wallet/service behaviors 1bones2iaA7eijMrCUYYrY5xurwZNyYqU
  • 36. Building a Practical Blockchain Analytics Platform for Forensics 1. Representing the Graph Structured Queries Graph Traversal Hand-tuned Heuristics Machine Learning
  • 37. Graphs make natural blockchain representations... Great for representing broader interactions, relationships, and trends (plus they look cool) 03- Representing the Graph
  • 38. But not all graphs are created equal. Transformation to the Address-Address graph creates a loss of resolution and introduces assumptions 03- Representing the Graph
  • 39. This is especially true with Spark 03- Representing the Graph Implication: Operations will vary significantly in both parallelism and data locality based on edge versus vertex property distinctions.
  • 40. Different graphs serve different purposes... Address-Address Address-Tx Output-Output Input-Output Transaction-Transaction 03- Representing the Graph Tx1 Tx2 Tx4 Tx3
  • 41. … and have different properties 03- Representing the Graph Directed Acyclic Bipartite Address/Address Y N N Address/Transaction Y N Y Input/Output Y Y Y Output/Output Y Y N Transaction/Transaction Y Y N
  • 42. Why Spark + GraphX? Single node graph databases great for performance re: traversal, etc. But Spark+GraphX gives: ● One infrastructure for graphs, ML models, computation ● Access to graph properties in training, eg. in-degree ● Compute + Storage scalability 03- Representing the Graph
  • 43. 2. Building the ML Pipeline Structured Queries Graph Traversal Hand-tuned Heuristics Machine Learning
  • 44. 04- Building the ML Pipeline What you think will be a problem A Venn Diagram: Starting a ML Project What will actually be a problem Algorithm Selection Scaling Data Quality Feature Selection Skynet Training Set Determination
  • 45. 04- Building the ML Pipeline A simplifying recommendation Start with unsupervised methods and iterate: ● Faster ramp-up ● Fewer parameters to tune ● Less overfitting risk
  • 46. 04- Building the ML Pipeline It begins and ends with data Data SourcesData SourcesData SourcesData Sources Feature Extraction Selection, Transformation, Scaling Model Training & Application
  • 47. Feature extraction is harder than it looks... Which features to include? Balance risk of overfitting with need to capture relevant parameters. 04- Building the ML Pipeline
  • 48. … but it all comes down to Var(x) When considering a variable, ask two questions: ● What variance does this variable introduce to the system? ● Is this variance correlated to the result I want? 04- Building the ML Pipeline
  • 49. Blockchain data is only one part of a larger picture 02- Bitcoin’s Data Set Raw Blockchain Data 100GB (est.) 6MB/hr (approx.) Denormalization Network and Transient Data Derived Data ~1TB
  • 50. Big graphs need lots of tuning ● Edge/Vertex distinctions ○ Balance Locality, Parallelism ● Partition Strategies ● Data Denormalization ○ Eg. Reducing number, value of transactions between addresses to properties 03- Representing the Graph
  • 51. Blockchain data has some nuance... Global consensus follows the chain with the most work: 04- Building the ML Pipeline
  • 52. Blockchain data has some nuance... Global consensus follows the chain with the most work: 04- Building the ML Pipeline
  • 53. Blockchain data has some nuance... Global consensus follows the chain with the most work: 04- Building the ML Pipeline
  • 54. … but there are solutions. A processing delay can address virtually all rewrite cases, but hinders real-time analytics efforts. The answer: Lambda Architecture 04- Building the ML Pipeline
  • 55. Lambda Architecture 04- Building the ML Pipeline Real-time Input Batching ML Model ML Pipeline Model Update Feat. Extract Model Application Data Transform Result
  • 56. Why it works: Aggregates are slow to move... … but what is true in the aggregate cannot predict individual measurements. 04- Building the ML Pipeline
  • 57. Example: Tx Clustering and Anomaly Detection
  • 58. Lambda Architecture 05- Tx Clustering Example Real-time Input Batching ML Model ML Pipeline Model Update Feat. Extract Model Application Data Transform Result
  • 59. Focus: Batching Process 05- Tx Clustering Example Real-time Input Batching ML Model ML Pipeline Model Update Feat. Extract Model Application Data Transform Result
  • 60. Delay in processing addresses mutability concerns 05- Tx Clustering Example Blockchain Artificial Delay Processing Latency Trained Model Coverage Trained Model Usage time
  • 61. Feature extraction from live node data 05- Tx Clustering Example Input Input Input Output Output ID: Hash Metadata { ~20 features, including: ● Transaction shape (inputs, outputs) ● Value distribution ● Input types and ages
  • 62. Cassandra Data store Data is fed into a kmeans pipeline 07- Tx Clustering Example Spark Pull latest Tx Batch Tx Feature Extractor Feature Scaler Inc. Update Model
  • 63. Cassandra Data store Spark 2.0 ML pipelines give you pipeline persistence 05- Tx Clustering Example Spark Pull latest Tx Batch Tx Feature Extractor Feature Scaler Inc. Update Model
  • 64. Cassandra Data store Streaming pipeline reuses data components 05- Tx Clustering Example Spark Feature Scaler Tx Feature Extractor Streaming input Trained ML Model
  • 65. Cassandra Data store Trained model used for real-time anomaly detection 05- Tx Clustering Example Feature Scaler Tx Feature Extractor Trained ML Model Streaming input Tx Type & Anomaly result
  • 66. Cassandra Data store Also used to color graphs for efficient traversal decisionmaking 05- Tx Clustering Example Feature Scaler Tx Feature Extractor Trained ML Model Streaming input Transaction Type Result
  • 67. Cassandra Data store Also used to color graphs for efficient traversal decisionmaking 05- Tx Clustering Example Feature Scaler Tx Feature Extractor Trained ML Model Streaming input Transaction Type Result
  • 68. Cassandra Data store Also used to color graphs for efficient traversal decisionmaking 05- Tx Clustering Example Feature Scaler Tx Feature Extractor Trained ML Model Streaming input Transaction Type Result
  • 69. What to look for in Blockchain Analytics ❏ Infrastructure to ingest and transform metadata ❏ Multi-layered solution ❏ Graphs that address multiple purposes ❏ Resources for tuning graphs ❏ Real-time analytics (e.g. Lambda architecture)
  • 70. Use Cases: Cybercrime Ransomware $1B in 2016 Approaches ● Reactive - Follow the money ● Proactive - Anomaly detection Source: FBI, SonicWall
  • 71. Demo
  • 72. Use Cases Use Case(s) Industry AML and financial crimes, regulatory compliance Financial Services Provider/patient demographic data analysis, Claims fraud Healthcare Mobile payments, Identity management Telecommunications Supply chain analytics Manufacturing
  • 74.
  • 75. Data Transformation Machine Learning Modeling Private Chain ...