SlideShare a Scribd company logo
Real Time Fraud Detection With
Sequence Mining on Big Data
Platform
Pranab Ghosh
Big Data Consultant
IEEE CNSV meeting, May 6 2014
Santa Clara, CA
Open Source Big Data Eco
System
 Query (NOSQL) : Cassandra, HBase, MongoDB and
more
 Query (SQL) : Hive, Stinger, Impala, Presto, Shark
 Aggregation Analytic (NOSQL) : Druid, MongoDB
 Aggregation Analytic (SQL) : Hive, Stinger, Impala,
Presto, Shark
 Advanced Analytic : Hadoop, Spark
 Real time : Storm, Samza, S4, Spark Streaming
Hadoop
 Power of functional programming and parallel
processing join hands in Hadoop
 Parallel processing framework running on cluster of
commodity machines
 Stateless functional programming because processing
of each row of data does not depend upon any other
row or any state
 Divide and conquer. Data gets partitioned and
partitions get processed in parallel
Storm
 Clustered framework for scalable real time stream
processing
 Like Hadoop, parallel processing framework runs on
cluster of commodity machines
 Uses a combination of processes and threads for
parallelism, unlike Hadoop which uses only processes
 Unlike 2 processing stages in Hadoop (map and
reduce) there can be multiple processing stages
defined in a Storm topology.
Redis
 It’s a wonderful glue for Big Data eco system
 Many people think of it as a distributed data structure
server
 Can be used for list, queue, cache etc.
 Supports master slave replication
 There is no sharding support
Fraud Detection Basics
 Belongs to the general category of problems known as
Outlier Detection i.e., detecting data points that don’t follow
the trends and patters in the data
 Two approaches for treating input: 1. focus on instance of
data point 2. focus on sequence of data points
 Two kinds of algorithms : 1. building a model out of data 2.
using data directly.
 Real time fraud detection is only feasible with model based
approach. A model is built with batch processing of training
data. A real time stream processor uses the model and
makes predictions in real time
Credit Card Fraud Detection
 We will use sequence based approach. We will use
Hadoop to build a Markov Chain model
 Storm will process transaction data in real time and will
use the Markov Chain model to predict potential fraud
in the incoming transaction stream
 Redis is used for the incoming transaction queue. It’s
also used as key value store for the Markov model.
Fraudulent transaction sequences are written to
another Redis queue
Why Sequence Based
Algorithm
 The sequence based algorithms are generally
powerful. Certain fraudulent activities may not be
detectable with instance based algorithms
 If your bank account is hacked and there are many
transactions involving withdrawal of small amount of
money, instance based algorithms will fail to detect the
fraud
 However a sequence based algorithm is more likely to
detect the fraudulent activities.
Architecture
Building Markov Chain Model
with Hadoop
 A transaction consists of the triplet (amount spent,
whether it includes high price ticket item, time elapsed
since the last transaction)
 Each item in the triplet is categorical and has possible
2 or 3 values. We end up with 18 possible transaction
types.
 Our goal is to build a 18 x 18 state transition probability
matrix
 Input data consists of customer ID, transaction ID and
transaction type.
Building Markov Chain Model
with Hadoop
 The data is processed through a Map Reduce job to
group by customer ID, so that we can have the
transaction sequence for each customer.
 The next Map Reduce job counts the the different state
transitions and builds the the 18 x 18 state transition
probability matrix
 We are using the first order Markov model i.e., the
probability of a state only depends on the earlier state.
Real Time Prediction with
Storm
 Our Storm topology (i.e. job) is very simple, with one spout
for ingesting transaction stream from a Redis queue. There
is one bolt that does all the processing.
 The incoming transaction stream is field grouped (i.e.,
partitioned) by the customer ID, so that each bolt instance
processes data for a subset of customers.
 Each bolt maintains a window of preconfigured size for each
user, where it collects the recent n transactions
 Every time the window gets updated, the bolt calculates
certain metric indicative of whether the current transaction
sequence is fraudulent
Real Time Prediction with
Storm
 There are are various outlier metrics for sequence. The one
we are using is called Miss Probability Metric
 For a transaction state pair in the sequence, it sums all the
state transition probabilities except when the target state is
the second state of the pair. The process is repeated for all
pair in the sequence and all probabilities are summed.
 If the final sum exceeds some predefined threshold, the
transaction sequence is deemed fraudulent. The transaction
sequence is written to a Redis queue
Sequence Outlier Metric
 The metric we have used reflects the probability of the transaction
sequence in the window. Higher the metric, lower the probability of
the transaction sequence.
 Lower the probability of the transaction sequence, higher the
likelihood of the transaction sequence being an outlier i.e.,
fraudulent.
 The size of the window holding the current transactions is an
important factor. Smaller the time window more locally sensitive
the result
 Proper choice of the metric threshold is critical. A smaller value will
cause more false positives and a larger value will cause more
false negatives. Since false negatives are costlier, it’s more
conservative to choose a smaller threshold value
Some Storm Limitations
 Storm guarantees at least once message processing
semantics. Some messages may be processed twice.
Impact of this is not too serious in this case.
 Storm does not provide any state management
features. We were doing state full processing inside the
storm bolt with recent transactions stored in an in
memory buffer.
 Impact of this could be serious in our case. A Storm
node might go down right around when the buffer
contains potentially fraudulent transactions.
Summing Up
 We have shown how different components of the Big
Data ecosystem can be orchestrated to solve the real
time fraud detection problem
 We have used Hadoop and Storm effectively as a
solution platform. However, if I were to start the project
today, I would seriously consider Spark and Spark
Streaming.
Resources
 For more details, please visit my blog at
http://pkghosh.wordpress.com/2013/10/21/real-time-
fraud-detection-with-sequence-mining/
 The implementation is part of my OSS project on github
at https://github.com/pranab/beymani
 A survey of different outlier detection algorithms is
available in this blog post of mine
http://pkghosh.wordpress.com/2012/01/02/fraudsters-
outliers-and-big-data-2/

More Related Content

Viewers also liked

Medicare fraud detection
Medicare fraud detection Medicare fraud detection
Medicare fraud detection
Xinyu (Max) Liu
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
Contexti
 
Healthcare fraud detection
Healthcare fraud detectionHealthcare fraud detection
Healthcare fraud detection
Mahdi Esmailoghli
 
Fraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformFraud Detection Using A Database Platform
Fraud Detection Using A Database Platform
EZ-R Stats, LLC
 
How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?
Linkurious
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
Pranab Ghosh
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
Mk Kim
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
DataWorks Summit/Hadoop Summit
 
Fraud in the Banking Sector
Fraud in the Banking Sector Fraud in the Banking Sector
Fraud in the Banking Sector
Venktesh Venke
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and Mitigation
Scott Mongeau
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
hkbhadraa
 
Fraud detection
Fraud detectionFraud detection
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
kalpesh1908
 
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
Seeling Cheung
 
Big Fish Games: Democratizing Data Access
Big Fish Games: Democratizing Data AccessBig Fish Games: Democratizing Data Access
Big Fish Games: Democratizing Data Access
Seeling Cheung
 
Medical University of South Carolina: Using Big Data and Predictive Analytics...
Medical University of South Carolina: Using Big Data and Predictive Analytics...Medical University of South Carolina: Using Big Data and Predictive Analytics...
Medical University of South Carolina: Using Big Data and Predictive Analytics...
Seeling Cheung
 
Southwest Power Pool big data case study
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study
Seeling Cheung
 
Constant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake JourneyConstant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake Journey
Seeling Cheung
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Seeling Cheung
 

Viewers also liked (20)

Medicare fraud detection
Medicare fraud detection Medicare fraud detection
Medicare fraud detection
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Healthcare fraud detection
Healthcare fraud detectionHealthcare fraud detection
Healthcare fraud detection
 
Fraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformFraud Detection Using A Database Platform
Fraud Detection Using A Database Platform
 
How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
 
Fraud in the Banking Sector
Fraud in the Banking Sector Fraud in the Banking Sector
Fraud in the Banking Sector
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and Mitigation
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
 
Fraud detection
Fraud detectionFraud detection
Fraud detection
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
 
Big Fish Games: Democratizing Data Access
Big Fish Games: Democratizing Data AccessBig Fish Games: Democratizing Data Access
Big Fish Games: Democratizing Data Access
 
Medical University of South Carolina: Using Big Data and Predictive Analytics...
Medical University of South Carolina: Using Big Data and Predictive Analytics...Medical University of South Carolina: Using Big Data and Predictive Analytics...
Medical University of South Carolina: Using Big Data and Predictive Analytics...
 
Southwest Power Pool big data case study
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study
 
Constant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake JourneyConstant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake Journey
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
 

Similar to Real timefrauddetectiononbigdata

Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
Luay AL-Assadi
 
Ieee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementIeee transactions on 2018 network and service management
Ieee transactions on 2018 network and service management
tsysglobalsolutions
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Survey on An effective database tampering system with veriable computation a...
Survey on An effective  database tampering system with veriable computation a...Survey on An effective  database tampering system with veriable computation a...
Survey on An effective database tampering system with veriable computation a...
IRJET Journal
 
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
csandit
 
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENTREAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
cscpconf
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
Theo Schlossnagle
 
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
IJDKP
 
Big data for MNO
Big data for MNOBig data for MNO
Big data for MNO
Christian Ferenz
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
KaashivInfoTech Company
 
Big Data for Mobile Network Operator
Big Data for Mobile Network OperatorBig Data for Mobile Network Operator
Big Data for Mobile Network Operator
Erkan Örün
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Lot Explorer Report
Lot Explorer ReportLot Explorer Report
Lot Explorer Report
Christopher Roman
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
Varun Vijayaraghavan
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
manishduttpurohit
 

Similar to Real timefrauddetectiononbigdata (20)

Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Ieee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementIeee transactions on 2018 network and service management
Ieee transactions on 2018 network and service management
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Survey on An effective database tampering system with veriable computation a...
Survey on An effective  database tampering system with veriable computation a...Survey on An effective  database tampering system with veriable computation a...
Survey on An effective database tampering system with veriable computation a...
 
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
 
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENTREAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
 
Big data for MNO
Big data for MNOBig data for MNO
Big data for MNO
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
 
Big Data for Mobile Network Operator
Big Data for Mobile Network OperatorBig Data for Mobile Network Operator
Big Data for Mobile Network Operator
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Lot Explorer Report
Lot Explorer ReportLot Explorer Report
Lot Explorer Report
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Real timefrauddetectiononbigdata

  • 1. Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
  • 2. Open Source Big Data Eco System  Query (NOSQL) : Cassandra, HBase, MongoDB and more  Query (SQL) : Hive, Stinger, Impala, Presto, Shark  Aggregation Analytic (NOSQL) : Druid, MongoDB  Aggregation Analytic (SQL) : Hive, Stinger, Impala, Presto, Shark  Advanced Analytic : Hadoop, Spark  Real time : Storm, Samza, S4, Spark Streaming
  • 3. Hadoop  Power of functional programming and parallel processing join hands in Hadoop  Parallel processing framework running on cluster of commodity machines  Stateless functional programming because processing of each row of data does not depend upon any other row or any state  Divide and conquer. Data gets partitioned and partitions get processed in parallel
  • 4. Storm  Clustered framework for scalable real time stream processing  Like Hadoop, parallel processing framework runs on cluster of commodity machines  Uses a combination of processes and threads for parallelism, unlike Hadoop which uses only processes  Unlike 2 processing stages in Hadoop (map and reduce) there can be multiple processing stages defined in a Storm topology.
  • 5. Redis  It’s a wonderful glue for Big Data eco system  Many people think of it as a distributed data structure server  Can be used for list, queue, cache etc.  Supports master slave replication  There is no sharding support
  • 6. Fraud Detection Basics  Belongs to the general category of problems known as Outlier Detection i.e., detecting data points that don’t follow the trends and patters in the data  Two approaches for treating input: 1. focus on instance of data point 2. focus on sequence of data points  Two kinds of algorithms : 1. building a model out of data 2. using data directly.  Real time fraud detection is only feasible with model based approach. A model is built with batch processing of training data. A real time stream processor uses the model and makes predictions in real time
  • 7. Credit Card Fraud Detection  We will use sequence based approach. We will use Hadoop to build a Markov Chain model  Storm will process transaction data in real time and will use the Markov Chain model to predict potential fraud in the incoming transaction stream  Redis is used for the incoming transaction queue. It’s also used as key value store for the Markov model. Fraudulent transaction sequences are written to another Redis queue
  • 8. Why Sequence Based Algorithm  The sequence based algorithms are generally powerful. Certain fraudulent activities may not be detectable with instance based algorithms  If your bank account is hacked and there are many transactions involving withdrawal of small amount of money, instance based algorithms will fail to detect the fraud  However a sequence based algorithm is more likely to detect the fraudulent activities.
  • 10. Building Markov Chain Model with Hadoop  A transaction consists of the triplet (amount spent, whether it includes high price ticket item, time elapsed since the last transaction)  Each item in the triplet is categorical and has possible 2 or 3 values. We end up with 18 possible transaction types.  Our goal is to build a 18 x 18 state transition probability matrix  Input data consists of customer ID, transaction ID and transaction type.
  • 11. Building Markov Chain Model with Hadoop  The data is processed through a Map Reduce job to group by customer ID, so that we can have the transaction sequence for each customer.  The next Map Reduce job counts the the different state transitions and builds the the 18 x 18 state transition probability matrix  We are using the first order Markov model i.e., the probability of a state only depends on the earlier state.
  • 12. Real Time Prediction with Storm  Our Storm topology (i.e. job) is very simple, with one spout for ingesting transaction stream from a Redis queue. There is one bolt that does all the processing.  The incoming transaction stream is field grouped (i.e., partitioned) by the customer ID, so that each bolt instance processes data for a subset of customers.  Each bolt maintains a window of preconfigured size for each user, where it collects the recent n transactions  Every time the window gets updated, the bolt calculates certain metric indicative of whether the current transaction sequence is fraudulent
  • 13. Real Time Prediction with Storm  There are are various outlier metrics for sequence. The one we are using is called Miss Probability Metric  For a transaction state pair in the sequence, it sums all the state transition probabilities except when the target state is the second state of the pair. The process is repeated for all pair in the sequence and all probabilities are summed.  If the final sum exceeds some predefined threshold, the transaction sequence is deemed fraudulent. The transaction sequence is written to a Redis queue
  • 14. Sequence Outlier Metric  The metric we have used reflects the probability of the transaction sequence in the window. Higher the metric, lower the probability of the transaction sequence.  Lower the probability of the transaction sequence, higher the likelihood of the transaction sequence being an outlier i.e., fraudulent.  The size of the window holding the current transactions is an important factor. Smaller the time window more locally sensitive the result  Proper choice of the metric threshold is critical. A smaller value will cause more false positives and a larger value will cause more false negatives. Since false negatives are costlier, it’s more conservative to choose a smaller threshold value
  • 15. Some Storm Limitations  Storm guarantees at least once message processing semantics. Some messages may be processed twice. Impact of this is not too serious in this case.  Storm does not provide any state management features. We were doing state full processing inside the storm bolt with recent transactions stored in an in memory buffer.  Impact of this could be serious in our case. A Storm node might go down right around when the buffer contains potentially fraudulent transactions.
  • 16. Summing Up  We have shown how different components of the Big Data ecosystem can be orchestrated to solve the real time fraud detection problem  We have used Hadoop and Storm effectively as a solution platform. However, if I were to start the project today, I would seriously consider Spark and Spark Streaming.
  • 17. Resources  For more details, please visit my blog at http://pkghosh.wordpress.com/2013/10/21/real-time- fraud-detection-with-sequence-mining/  The implementation is part of my OSS project on github at https://github.com/pranab/beymani  A survey of different outlier detection algorithms is available in this blog post of mine http://pkghosh.wordpress.com/2012/01/02/fraudsters- outliers-and-big-data-2/