SlideShare a Scribd company logo
1 of 17
Real Time Fraud Detection With
Sequence Mining on Big Data
Platform
Pranab Ghosh
Big Data Consultant
IEEE CNSV meeting, May 6 2014
Santa Clara, CA
Open Source Big Data Eco
System
 Query (NOSQL) : Cassandra, HBase, MongoDB and
more
 Query (SQL) : Hive, Stinger, Impala, Presto, Shark
 Aggregation Analytic (NOSQL) : Druid, MongoDB
 Aggregation Analytic (SQL) : Hive, Stinger, Impala,
Presto, Shark
 Advanced Analytic : Hadoop, Spark
 Real time : Storm, Samza, S4, Spark Streaming
Hadoop
 Power of functional programming and parallel
processing join hands in Hadoop
 Parallel processing framework running on cluster of
commodity machines
 Stateless functional programming because processing
of each row of data does not depend upon any other
row or any state
 Divide and conquer. Data gets partitioned and
partitions get processed in parallel
Storm
 Clustered framework for scalable real time stream
processing
 Like Hadoop, parallel processing framework runs on
cluster of commodity machines
 Uses a combination of processes and threads for
parallelism, unlike Hadoop which uses only processes
 Unlike 2 processing stages in Hadoop (map and
reduce) there can be multiple processing stages
defined in a Storm topology.
Redis
 It’s a wonderful glue for Big Data eco system
 Many people think of it as a distributed data structure
server
 Can be used for list, queue, cache etc.
 Supports master slave replication
 There is no sharding support
Fraud Detection Basics
 Belongs to the general category of problems known as
Outlier Detection i.e., detecting data points that don’t follow
the trends and patters in the data
 Two approaches for treating input: 1. focus on instance of
data point 2. focus on sequence of data points
 Two kinds of algorithms : 1. building a model out of data 2.
using data directly.
 Real time fraud detection is only feasible with model based
approach. A model is built with batch processing of training
data. A real time stream processor uses the model and
makes predictions in real time
Credit Card Fraud Detection
 We will use sequence based approach. We will use
Hadoop to build a Markov Chain model
 Storm will process transaction data in real time and will
use the Markov Chain model to predict potential fraud
in the incoming transaction stream
 Redis is used for the incoming transaction queue. It’s
also used as key value store for the Markov model.
Fraudulent transaction sequences are written to
another Redis queue
Why Sequence Based
Algorithm
 The sequence based algorithms are generally
powerful. Certain fraudulent activities may not be
detectable with instance based algorithms
 If your bank account is hacked and there are many
transactions involving withdrawal of small amount of
money, instance based algorithms will fail to detect the
fraud
 However a sequence based algorithm is more likely to
detect the fraudulent activities.
Architecture
Building Markov Chain Model
with Hadoop
 A transaction consists of the triplet (amount spent,
whether it includes high price ticket item, time elapsed
since the last transaction)
 Each item in the triplet is categorical and has possible
2 or 3 values. We end up with 18 possible transaction
types.
 Our goal is to build a 18 x 18 state transition probability
matrix
 Input data consists of customer ID, transaction ID and
transaction type.
Building Markov Chain Model
with Hadoop
 The data is processed through a Map Reduce job to
group by customer ID, so that we can have the
transaction sequence for each customer.
 The next Map Reduce job counts the the different state
transitions and builds the the 18 x 18 state transition
probability matrix
 We are using the first order Markov model i.e., the
probability of a state only depends on the earlier state.
Real Time Prediction with
Storm
 Our Storm topology (i.e. job) is very simple, with one spout
for ingesting transaction stream from a Redis queue. There
is one bolt that does all the processing.
 The incoming transaction stream is field grouped (i.e.,
partitioned) by the customer ID, so that each bolt instance
processes data for a subset of customers.
 Each bolt maintains a window of preconfigured size for each
user, where it collects the recent n transactions
 Every time the window gets updated, the bolt calculates
certain metric indicative of whether the current transaction
sequence is fraudulent
Real Time Prediction with
Storm
 There are are various outlier metrics for sequence. The one
we are using is called Miss Probability Metric
 For a transaction state pair in the sequence, it sums all the
state transition probabilities except when the target state is
the second state of the pair. The process is repeated for all
pair in the sequence and all probabilities are summed.
 If the final sum exceeds some predefined threshold, the
transaction sequence is deemed fraudulent. The transaction
sequence is written to a Redis queue
Sequence Outlier Metric
 The metric we have used reflects the probability of the transaction
sequence in the window. Higher the metric, lower the probability of
the transaction sequence.
 Lower the probability of the transaction sequence, higher the
likelihood of the transaction sequence being an outlier i.e.,
fraudulent.
 The size of the window holding the current transactions is an
important factor. Smaller the time window more locally sensitive
the result
 Proper choice of the metric threshold is critical. A smaller value will
cause more false positives and a larger value will cause more
false negatives. Since false negatives are costlier, it’s more
conservative to choose a smaller threshold value
Some Storm Limitations
 Storm guarantees at least once message processing
semantics. Some messages may be processed twice.
Impact of this is not too serious in this case.
 Storm does not provide any state management
features. We were doing state full processing inside the
storm bolt with recent transactions stored in an in
memory buffer.
 Impact of this could be serious in our case. A Storm
node might go down right around when the buffer
contains potentially fraudulent transactions.
Summing Up
 We have shown how different components of the Big
Data ecosystem can be orchestrated to solve the real
time fraud detection problem
 We have used Hadoop and Storm effectively as a
solution platform. However, if I were to start the project
today, I would seriously consider Spark and Spark
Streaming.
Resources
 For more details, please visit my blog at
http://pkghosh.wordpress.com/2013/10/21/real-time-
fraud-detection-with-sequence-mining/
 The implementation is part of my OSS project on github
at https://github.com/pranab/beymani
 A survey of different outlier detection algorithms is
available in this blog post of mine
http://pkghosh.wordpress.com/2012/01/02/fraudsters-
outliers-and-big-data-2/

More Related Content

Viewers also liked

Fraud in the Banking Sector
Fraud in the Banking Sector Fraud in the Banking Sector
Fraud in the Banking Sector
Venktesh Venke
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
kalpesh1908
 

Viewers also liked (20)

Medicare fraud detection
Medicare fraud detection Medicare fraud detection
Medicare fraud detection
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Healthcare fraud detection
Healthcare fraud detectionHealthcare fraud detection
Healthcare fraud detection
 
Fraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformFraud Detection Using A Database Platform
Fraud Detection Using A Database Platform
 
How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?How to apply graph analytics for bank loan fraud detection?
How to apply graph analytics for bank loan fraud detection?
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
 
Fraud in the Banking Sector
Fraud in the Banking Sector Fraud in the Banking Sector
Fraud in the Banking Sector
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and Mitigation
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
 
Fraud detection
Fraud detectionFraud detection
Fraud detection
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
 
Big Fish Games: Democratizing Data Access
Big Fish Games: Democratizing Data AccessBig Fish Games: Democratizing Data Access
Big Fish Games: Democratizing Data Access
 
Medical University of South Carolina: Using Big Data and Predictive Analytics...
Medical University of South Carolina: Using Big Data and Predictive Analytics...Medical University of South Carolina: Using Big Data and Predictive Analytics...
Medical University of South Carolina: Using Big Data and Predictive Analytics...
 
Southwest Power Pool big data case study
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study
 
Constant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake JourneyConstant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake Journey
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
 

Similar to Real timefrauddetectiononbigdata

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 

Similar to Real timefrauddetectiononbigdata (20)

Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Ieee transactions on 2018 network and service management
Ieee transactions on 2018 network and service managementIeee transactions on 2018 network and service management
Ieee transactions on 2018 network and service management
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Survey on An effective database tampering system with veriable computation a...
Survey on An effective  database tampering system with veriable computation a...Survey on An effective  database tampering system with veriable computation a...
Survey on An effective database tampering system with veriable computation a...
 
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENTREAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
 
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
Real-Time Pedestrian Detection Using Apache Storm in a Distributed Environment
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTRO...
 
Big data for MNO
Big data for MNOBig data for MNO
Big data for MNO
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
Magiclock: Scalable Detection of Potential Deadlocks in Large-Scale Multithre...
 
Big Data for Mobile Network Operator
Big Data for Mobile Network OperatorBig Data for Mobile Network Operator
Big Data for Mobile Network Operator
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Lot Explorer Report
Lot Explorer ReportLot Explorer Report
Lot Explorer Report
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Real timefrauddetectiononbigdata

  • 1. Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
  • 2. Open Source Big Data Eco System  Query (NOSQL) : Cassandra, HBase, MongoDB and more  Query (SQL) : Hive, Stinger, Impala, Presto, Shark  Aggregation Analytic (NOSQL) : Druid, MongoDB  Aggregation Analytic (SQL) : Hive, Stinger, Impala, Presto, Shark  Advanced Analytic : Hadoop, Spark  Real time : Storm, Samza, S4, Spark Streaming
  • 3. Hadoop  Power of functional programming and parallel processing join hands in Hadoop  Parallel processing framework running on cluster of commodity machines  Stateless functional programming because processing of each row of data does not depend upon any other row or any state  Divide and conquer. Data gets partitioned and partitions get processed in parallel
  • 4. Storm  Clustered framework for scalable real time stream processing  Like Hadoop, parallel processing framework runs on cluster of commodity machines  Uses a combination of processes and threads for parallelism, unlike Hadoop which uses only processes  Unlike 2 processing stages in Hadoop (map and reduce) there can be multiple processing stages defined in a Storm topology.
  • 5. Redis  It’s a wonderful glue for Big Data eco system  Many people think of it as a distributed data structure server  Can be used for list, queue, cache etc.  Supports master slave replication  There is no sharding support
  • 6. Fraud Detection Basics  Belongs to the general category of problems known as Outlier Detection i.e., detecting data points that don’t follow the trends and patters in the data  Two approaches for treating input: 1. focus on instance of data point 2. focus on sequence of data points  Two kinds of algorithms : 1. building a model out of data 2. using data directly.  Real time fraud detection is only feasible with model based approach. A model is built with batch processing of training data. A real time stream processor uses the model and makes predictions in real time
  • 7. Credit Card Fraud Detection  We will use sequence based approach. We will use Hadoop to build a Markov Chain model  Storm will process transaction data in real time and will use the Markov Chain model to predict potential fraud in the incoming transaction stream  Redis is used for the incoming transaction queue. It’s also used as key value store for the Markov model. Fraudulent transaction sequences are written to another Redis queue
  • 8. Why Sequence Based Algorithm  The sequence based algorithms are generally powerful. Certain fraudulent activities may not be detectable with instance based algorithms  If your bank account is hacked and there are many transactions involving withdrawal of small amount of money, instance based algorithms will fail to detect the fraud  However a sequence based algorithm is more likely to detect the fraudulent activities.
  • 10. Building Markov Chain Model with Hadoop  A transaction consists of the triplet (amount spent, whether it includes high price ticket item, time elapsed since the last transaction)  Each item in the triplet is categorical and has possible 2 or 3 values. We end up with 18 possible transaction types.  Our goal is to build a 18 x 18 state transition probability matrix  Input data consists of customer ID, transaction ID and transaction type.
  • 11. Building Markov Chain Model with Hadoop  The data is processed through a Map Reduce job to group by customer ID, so that we can have the transaction sequence for each customer.  The next Map Reduce job counts the the different state transitions and builds the the 18 x 18 state transition probability matrix  We are using the first order Markov model i.e., the probability of a state only depends on the earlier state.
  • 12. Real Time Prediction with Storm  Our Storm topology (i.e. job) is very simple, with one spout for ingesting transaction stream from a Redis queue. There is one bolt that does all the processing.  The incoming transaction stream is field grouped (i.e., partitioned) by the customer ID, so that each bolt instance processes data for a subset of customers.  Each bolt maintains a window of preconfigured size for each user, where it collects the recent n transactions  Every time the window gets updated, the bolt calculates certain metric indicative of whether the current transaction sequence is fraudulent
  • 13. Real Time Prediction with Storm  There are are various outlier metrics for sequence. The one we are using is called Miss Probability Metric  For a transaction state pair in the sequence, it sums all the state transition probabilities except when the target state is the second state of the pair. The process is repeated for all pair in the sequence and all probabilities are summed.  If the final sum exceeds some predefined threshold, the transaction sequence is deemed fraudulent. The transaction sequence is written to a Redis queue
  • 14. Sequence Outlier Metric  The metric we have used reflects the probability of the transaction sequence in the window. Higher the metric, lower the probability of the transaction sequence.  Lower the probability of the transaction sequence, higher the likelihood of the transaction sequence being an outlier i.e., fraudulent.  The size of the window holding the current transactions is an important factor. Smaller the time window more locally sensitive the result  Proper choice of the metric threshold is critical. A smaller value will cause more false positives and a larger value will cause more false negatives. Since false negatives are costlier, it’s more conservative to choose a smaller threshold value
  • 15. Some Storm Limitations  Storm guarantees at least once message processing semantics. Some messages may be processed twice. Impact of this is not too serious in this case.  Storm does not provide any state management features. We were doing state full processing inside the storm bolt with recent transactions stored in an in memory buffer.  Impact of this could be serious in our case. A Storm node might go down right around when the buffer contains potentially fraudulent transactions.
  • 16. Summing Up  We have shown how different components of the Big Data ecosystem can be orchestrated to solve the real time fraud detection problem  We have used Hadoop and Storm effectively as a solution platform. However, if I were to start the project today, I would seriously consider Spark and Spark Streaming.
  • 17. Resources  For more details, please visit my blog at http://pkghosh.wordpress.com/2013/10/21/real-time- fraud-detection-with-sequence-mining/  The implementation is part of my OSS project on github at https://github.com/pranab/beymani  A survey of different outlier detection algorithms is available in this blog post of mine http://pkghosh.wordpress.com/2012/01/02/fraudsters- outliers-and-big-data-2/