SlideShare a Scribd company logo
1 of 28
Outlier and Fraud Detection

 Big Data Science Meetup
        July, 2012
        Fremont, CA



                              1
About Me
                   PranabGhosh

•25+ years in the industry
• Worked with various technologies and platforms
• Worked for startups, Fortune 500 and anything in between
• Big data consultant for the last few years
• Currently consultant in Apple
• Active Blogger
• Blog site: http://pkghosh.wordpress.com/
• Owner of several open source projects
• Project site: https://github.com/pranab
• Passionate about data and finding patterns in data


                                                             2
My Open Source Hadooop Projects


•Recommendation engine (sifarish) based on content
based and social recommendation algorithms

•Fraud analytic (beymani) using proximity and
distribution model based algorithms. Today’s talk is
related to this project.

•Web click stream analytic (visitante) for descriptive
and predictive analytics




                                                         3
Outlier Detection


• Data that do not conform to the normal and expected
patterns are outliers
•Wide range of applications in various domains
including finance, security, intrusion detection in cyber
security
•Criteria for what constitutes an outlier depend on the
problem domain
•Typically involve large amount data, which may be
unstructured, creating opportunity of using big data
technologies



                                                            4
Data Type

•Instance data, where the outlier detection algorithm
operates on individual instance of data e.g., particular
credit transaction involving large amount of money
purchasing unusual product

•Sequence data with temporal or spatial relationship.
The goal of outlier detection is to find unusual
sequence e.g., intrusion detection and cyber security
.
•Our focus is on outlier detection for instance data
using Hadoop. We will be using credit card transaction
data as an example

                                                           5
Challenges

•Defining the normal regions in a data set is the main
challenge. The boundary between normal and outlier
may not be crisply defined.

• Definition of normal behavior may evolve with time.
What is normal today may be considered anomalous in
future and vice versa.

• In many cases the malicious adversaries adapt
themselves to make the operations look like normal and
try to stay undetected



                                                         6
Instance Based Analysis
•Supervised classification techniques using labeled
training data with normal and outlier data e.g., Bayesian
filtering, Neural Network, Support Vector Machine etc.
Not very reliable because of lack of labeled outlier data
•Multivariate probability distribution based. Data point
with low probability are likely to be outliers
• Proximity based approaches. Distance between data
points are calculated in a multi dimensional feature
space
•Relative density based. Density is inverse of average
distance to neighbors.
.

.
                                                            7
Instance Based Analysis (contd)
•Shared nearest neighbor based. We consider the
number of shared neighbor between neighboring data
points.
•Clustering based. Data points with poor cluster
membership are likely outliers.
•Information theory based. Inclusion of outlier causes
increase in entropy of the data set. We identify data
points whose removal causes large drop in entropy in
the data set
•… and many more techniques.

.


                                                         8
Sequence Based Analysis

•Having a list of known sequences corresponding to
malicious behavior and detecting those in the data. Does
not works well for new and unknown threats

•Markov chain which considers observable states and
probability of transition between states.

•Hidden Markov Model, where the system has both hidden
and observable states




                                                           9
Model Based vs Memory Based

•As you may have observed, with some of the methods,
we build a model from the training data and apply the
model to detect outliers

•With the other methods, we don’t build a model but use
the existing data directly to detect outliers

•The technique we will discuss today is based on the later
approach i.e., memory based.




                                                             10
Average Distance to k Neighbors

•We find the distance between each pair of points.
This has computational complexity of O(n x n)
•For any point we find k nearest neighbors, where k is
an user configured number
• For each point, we find the average distance to the k
nearest neighbors
• Identify data points with high average distance to it’s
neighbors. Outliers will have high average distance to
neighbors
•We can select data points above some threshold
average distance or choose the top n based on
avearge distance

                                                            11
Big Data Ecosystem




Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
                                                                             12
When to Use Hadoop




Credit: http://www.aaroncordova.com/2012/01/do-i-need-sql-or-hadoop-flowchart.html
                                                                                     13
Map Reduce Data Flow




Credit : Yahoo Developer Network
                                             14
Hadoop at 30000 ft
•MapReduce– Parallel processing pattern. Functional
programming model. Implemented as a framework, with
user supplied map and reduce code.

•HDFS– Replicated and partitioned file system. Sequential
access only. Writes are append only.

•Data Locality – Code moves where the data resides and
gets executed there.

•IO Bound – Typically IO bound (disk and network)



                                                            15
Credit Card Transaction
 We have a very simple data model. Each credit card
transaction contains the following 4 attributes

1.   Transaction ID
2.   Time of the day
3.   Money spent
4.   Vendor type


Here are some examples. The last one is an outlier,
injected into the data set.

YX66AJ9U           1025   20.47drug store
98ZCM6B1           1910   55.50             restaurant
XXXX7362           0100   1875.40           jewellery store



                                                              16
Distance Calculation
• For numerical attribute (e.g. money amount), distance is
the difference in values

• For unranked categorical attribute (e.g. vendor type), the
distance is 0 if they are same and 1 otherwise. The
distances could be set softly between 0 and 1 (e.g product
color).

•If the unranked categorical attributes have hierarchical
relationship, the minimum no of edges to traverse from
one node to the other could be used as distance (e.g.,
vendor type hierarchy)


                                                               17
Distance Aggregation

• We aggregate across all attributes to find the net
distance between 2 entities

•Different ways to aggregate: Euclidean, Manhattan.
Attributes can be weighted during aggregation, indicative
of their relative importance




                                                            18
Pair Wise Distance Calculation MR

•It’s an O(nxn) problem. If there are 1 million transactions,
we need to perform 1 trillion computation.
• The work will be divided up between the reducers. If we
have a 100 node Hadoop cluster with 10 reducers slots
per node, each reducer will roughly perform 1 billion
distance calculation.
•How do we divide up the work? Use partitioned hashing.
If h1 = hash(id1) and h2 = hash(id2), we use function of h1
and h2 as the key of the mapper output. For example
f(h1,h2) = h1 << 10 | h2.
•All the transactions with id hashed to h1 or h2 will end up
with the same reducer.

                                                                19
Partitioned Hashing

• Code snippet from SameTypeSimilarity.java
String partition = partitonOrdinal>= 0 ? items[partitonOrdinal] : "none";

   hash = (items[idOrdinal].hashCode() % bucketCount + bucketCount) / 2 ;
       for (inti = 0; i<bucketCount; ++i) {
   if (i< hash){
      hashPair = hash * 1000 + i;
      keyHolder.set(partition, hashPair,0);
      valueHolder.set("0" + value.toString());
      } else {
       hashPair = i * 1000 + hash;
       keyHolder.set(partition, hashPair,1);
       valueHolder.set("1" + value.toString());
   }
   context.write(keyHolder, valueHolder);




                                                                            20
Output of Distance MR

• The output has 3 fields: the first transaction ID, second
transaction ID and the distance

6JHQ79UA      JSXNUV9R      5
6JHQ79UA      Y1AWCM5P      89
6JHQ79UA      UFS5ZM0K      172




                                                              21
Nearest Neighbor MR

• Next we need to find the nearest k neighbors of each
data point. We essentially need the neighbors of a data
point sorted by distance.
•Use a technique called secondary sorting. We tag some
extra data to the key which will force the key to be sorted
by the data tagged as the mapper emits it’s key and value.
•Going back to the output of the previous MR, this is how
the mapper of this MR will emit key, value
key -> (6JHQ79UA, 5)      value -> (JSXNUV9R, 5)
key -> (JSXNUV9R,, 5)     value -> (6JHQ79UA, 5)

key -> (6JHQ79UA, 89)     value -> (Y1AWCM5P, 89)
key -> (Y1AWCM5P, 89)     value -> (6JHQ79UA, 172)




                                                              22
Nearest Neighbor MR (contd)

• On the reducer side when the reducer gets invoked, we
will get a transaction ID as a key and a list of neighboring
transaction ID and distance pair as the value
•In the reducer, we iterate through the values and take the
average distance and emit the transaction ID and average
distance as output. We could use median also.
1IKVOMZE       5
1JI0A0UE,      173
1KWBJ4W3,      278
...........
XXXX7362,      538
•As expected we find the outlier we injected into the
dataset having a very large average distance to it’s
neighbor.


                                                               23
Secondary Sorting

• Define reducer partitioner using the base part of the key
(transaction ID) ensuring all values for a key will be routed
to the same reducer

•Define grouping comparator using the base part of the
key, ensuring all the values for a transaction ID will be
passed in same reducer invocation

•Sorting is based on both parts of the key i.e. transaction
ID and the distance
.



                                                                24
How to Choose k

• High or low values for k will cause large error a.k.a bias
variance trade off

•Small k -> low bias error -> high variance error

•Large k -> low variance error -> high bias error

•Find optimum k by experimenting with different values.




                                                               25
Segmentation

• In reality data might be segmented or clustered first and
then outlier detection process run on the relevant cluster.

•What is normal in one segment may be an outlier in
another




                                                              26
Fraud or Emerging Normal Behavior

• We have been able to detect the outlier. But how do we
know whether it’s a fraudulent transaction or emerging
buying pattern.
•Your credit card may have been compromised and
someone is using it. Or you have fallen in love and
decided to shower him or her with expensive high price
ticket items.
•We can’t really tell the difference, except that once there
is enough data points for this emerging behavior, we won’t
be getting these false positives from our analysis




                                                               27
ThankYou



      Q&A
pghosh@yahoo.com
 Big Data Consultant




                       28

More Related Content

What's hot

Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014Roger Barga
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Privacy preserving machine learning
Privacy preserving machine learningPrivacy preserving machine learning
Privacy preserving machine learningMichał Kuźba
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
SplunkLive! Washington DC May 2013 - Search Language Beginner
SplunkLive! Washington DC May 2013 - Search Language BeginnerSplunkLive! Washington DC May 2013 - Search Language Beginner
SplunkLive! Washington DC May 2013 - Search Language BeginnerSplunk
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security AnalyticsAmrit Chhetri
 
Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big DataSaurabh Shanbhag
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringAn Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringIJECEIAES
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Summer Independent Study Report
Summer Independent Study ReportSummer Independent Study Report
Summer Independent Study ReportShreya Chakrabarti
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Seattle DAML meetup
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientistsAjay Ohri
 
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at ScaleBig Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at ScaleChester Parrott
 

What's hot (20)

Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
 
Beyond stream analytics
Beyond stream analyticsBeyond stream analytics
Beyond stream analytics
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Privacy preserving machine learning
Privacy preserving machine learningPrivacy preserving machine learning
Privacy preserving machine learning
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
SplunkLive! Washington DC May 2013 - Search Language Beginner
SplunkLive! Washington DC May 2013 - Search Language BeginnerSplunkLive! Washington DC May 2013 - Search Language Beginner
SplunkLive! Washington DC May 2013 - Search Language Beginner
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security Analytics
 
Visual Analytics in Big Data
Visual Analytics in Big DataVisual Analytics in Big Data
Visual Analytics in Big Data
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringAn Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream Clustering
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Summer Independent Study Report
Summer Independent Study ReportSummer Independent Study Report
Summer Independent Study Report
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientists
 
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at ScaleBig Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
Big Brother Vs. Big Data: Privacy-Preserving Threat Analytics at Scale
 

Viewers also liked

Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detectionMk Kim
 
Fraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformFraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformEZ-R Stats, LLC
 
VAT fraud detection : the mysterious case of the missing trader
VAT fraud detection : the mysterious case of the missing traderVAT fraud detection : the mysterious case of the missing trader
VAT fraud detection : the mysterious case of the missing traderLinkurious
 
Using benford's law for fraud detection and auditing
Using benford's law for fraud detection and auditingUsing benford's law for fraud detection and auditing
Using benford's law for fraud detection and auditingJim Kaplan CIA CFE
 
Analysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detectionAnalysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detectionJustluk Luk
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionanthonytaylor01
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)Amazon Web Services
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationScott Mongeau
 
A visual approach to fraud detection and investigation - Giuseppe Francavilla
A visual approach to fraud detection and investigation - Giuseppe FrancavillaA visual approach to fraud detection and investigation - Giuseppe Francavilla
A visual approach to fraud detection and investigation - Giuseppe FrancavillaData Driven Innovation
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analyticshkbhadraa
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionkalpesh1908
 
Audit,fraud detection Using Picalo
Audit,fraud detection Using PicaloAudit,fraud detection Using Picalo
Audit,fraud detection Using Picaloguest4ea866f
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlDominic Sroda Korkoryi
 
Inconsistent Outliers
Inconsistent OutliersInconsistent Outliers
Inconsistent OutliersNeil Rubens
 
Apt 502 distance tech handouts study guides and visuals
Apt 502 distance tech handouts study guides and visualsApt 502 distance tech handouts study guides and visuals
Apt 502 distance tech handouts study guides and visualsBetty Howard
 

Viewers also liked (20)

Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Fraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformFraud Detection Using A Database Platform
Fraud Detection Using A Database Platform
 
VAT fraud detection : the mysterious case of the missing trader
VAT fraud detection : the mysterious case of the missing traderVAT fraud detection : the mysterious case of the missing trader
VAT fraud detection : the mysterious case of the missing trader
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Using benford's law for fraud detection and auditing
Using benford's law for fraud detection and auditingUsing benford's law for fraud detection and auditing
Using benford's law for fraud detection and auditing
 
Analysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detectionAnalysis of-credit-card-fault-detection
Analysis of-credit-card-fault-detection
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and Mitigation
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
A visual approach to fraud detection and investigation - Giuseppe Francavilla
A visual approach to fraud detection and investigation - Giuseppe FrancavillaA visual approach to fraud detection and investigation - Giuseppe Francavilla
A visual approach to fraud detection and investigation - Giuseppe Francavilla
 
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time AnalyticsHadoop BIG Data - Fraud Detection with Real-Time Analytics
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Audit,fraud detection Using Picalo
Audit,fraud detection Using PicaloAudit,fraud detection Using Picalo
Audit,fraud detection Using Picalo
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & control
 
Inconsistent Outliers
Inconsistent OutliersInconsistent Outliers
Inconsistent Outliers
 
Apt 502 distance tech handouts study guides and visuals
Apt 502 distance tech handouts study guides and visualsApt 502 distance tech handouts study guides and visuals
Apt 502 distance tech handouts study guides and visuals
 
projects_with_descriptions
projects_with_descriptionsprojects_with_descriptions
projects_with_descriptions
 

Similar to Outlier and fraud detection using Hadoop

State of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAMState of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAMNeo4j
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating SystemITz_1
 
Cloud-Based Big Data Analytics
Cloud-Based Big Data AnalyticsCloud-Based Big Data Analytics
Cloud-Based Big Data AnalyticsSateeshreddy N
 
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...epamspb
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining IntroAsma CHERIF
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceNeo4j
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxIvo Andreev
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Leveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim DataLeveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim DataNeo4j
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big datawebwinkelvakdag
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Secure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data ProcessingSecure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data ProcessingShantanu Sharma
 

Similar to Outlier and fraud detection using Hadoop (20)

State of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAMState of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAM
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
 
Cloud-Based Big Data Analytics
Cloud-Based Big Data AnalyticsCloud-Based Big Data Analytics
Cloud-Based Big Data Analytics
 
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Data science guide
Data science guideData science guide
Data science guide
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Leveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim DataLeveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim Data
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Secure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data ProcessingSecure and Privacy-Preserving Big-Data Processing
Secure and Privacy-Preserving Big-Data Processing
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Outlier and fraud detection using Hadoop

  • 1. Outlier and Fraud Detection Big Data Science Meetup July, 2012 Fremont, CA 1
  • 2. About Me PranabGhosh •25+ years in the industry • Worked with various technologies and platforms • Worked for startups, Fortune 500 and anything in between • Big data consultant for the last few years • Currently consultant in Apple • Active Blogger • Blog site: http://pkghosh.wordpress.com/ • Owner of several open source projects • Project site: https://github.com/pranab • Passionate about data and finding patterns in data 2
  • 3. My Open Source Hadooop Projects •Recommendation engine (sifarish) based on content based and social recommendation algorithms •Fraud analytic (beymani) using proximity and distribution model based algorithms. Today’s talk is related to this project. •Web click stream analytic (visitante) for descriptive and predictive analytics 3
  • 4. Outlier Detection • Data that do not conform to the normal and expected patterns are outliers •Wide range of applications in various domains including finance, security, intrusion detection in cyber security •Criteria for what constitutes an outlier depend on the problem domain •Typically involve large amount data, which may be unstructured, creating opportunity of using big data technologies 4
  • 5. Data Type •Instance data, where the outlier detection algorithm operates on individual instance of data e.g., particular credit transaction involving large amount of money purchasing unusual product •Sequence data with temporal or spatial relationship. The goal of outlier detection is to find unusual sequence e.g., intrusion detection and cyber security . •Our focus is on outlier detection for instance data using Hadoop. We will be using credit card transaction data as an example 5
  • 6. Challenges •Defining the normal regions in a data set is the main challenge. The boundary between normal and outlier may not be crisply defined. • Definition of normal behavior may evolve with time. What is normal today may be considered anomalous in future and vice versa. • In many cases the malicious adversaries adapt themselves to make the operations look like normal and try to stay undetected 6
  • 7. Instance Based Analysis •Supervised classification techniques using labeled training data with normal and outlier data e.g., Bayesian filtering, Neural Network, Support Vector Machine etc. Not very reliable because of lack of labeled outlier data •Multivariate probability distribution based. Data point with low probability are likely to be outliers • Proximity based approaches. Distance between data points are calculated in a multi dimensional feature space •Relative density based. Density is inverse of average distance to neighbors. . . 7
  • 8. Instance Based Analysis (contd) •Shared nearest neighbor based. We consider the number of shared neighbor between neighboring data points. •Clustering based. Data points with poor cluster membership are likely outliers. •Information theory based. Inclusion of outlier causes increase in entropy of the data set. We identify data points whose removal causes large drop in entropy in the data set •… and many more techniques. . 8
  • 9. Sequence Based Analysis •Having a list of known sequences corresponding to malicious behavior and detecting those in the data. Does not works well for new and unknown threats •Markov chain which considers observable states and probability of transition between states. •Hidden Markov Model, where the system has both hidden and observable states 9
  • 10. Model Based vs Memory Based •As you may have observed, with some of the methods, we build a model from the training data and apply the model to detect outliers •With the other methods, we don’t build a model but use the existing data directly to detect outliers •The technique we will discuss today is based on the later approach i.e., memory based. 10
  • 11. Average Distance to k Neighbors •We find the distance between each pair of points. This has computational complexity of O(n x n) •For any point we find k nearest neighbors, where k is an user configured number • For each point, we find the average distance to the k nearest neighbors • Identify data points with high average distance to it’s neighbors. Outliers will have high average distance to neighbors •We can select data points above some threshold average distance or choose the top n based on avearge distance 11
  • 12. Big Data Ecosystem Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/ 12
  • 13. When to Use Hadoop Credit: http://www.aaroncordova.com/2012/01/do-i-need-sql-or-hadoop-flowchart.html 13
  • 14. Map Reduce Data Flow Credit : Yahoo Developer Network 14
  • 15. Hadoop at 30000 ft •MapReduce– Parallel processing pattern. Functional programming model. Implemented as a framework, with user supplied map and reduce code. •HDFS– Replicated and partitioned file system. Sequential access only. Writes are append only. •Data Locality – Code moves where the data resides and gets executed there. •IO Bound – Typically IO bound (disk and network) 15
  • 16. Credit Card Transaction We have a very simple data model. Each credit card transaction contains the following 4 attributes 1. Transaction ID 2. Time of the day 3. Money spent 4. Vendor type Here are some examples. The last one is an outlier, injected into the data set. YX66AJ9U 1025 20.47drug store 98ZCM6B1 1910 55.50 restaurant XXXX7362 0100 1875.40 jewellery store 16
  • 17. Distance Calculation • For numerical attribute (e.g. money amount), distance is the difference in values • For unranked categorical attribute (e.g. vendor type), the distance is 0 if they are same and 1 otherwise. The distances could be set softly between 0 and 1 (e.g product color). •If the unranked categorical attributes have hierarchical relationship, the minimum no of edges to traverse from one node to the other could be used as distance (e.g., vendor type hierarchy) 17
  • 18. Distance Aggregation • We aggregate across all attributes to find the net distance between 2 entities •Different ways to aggregate: Euclidean, Manhattan. Attributes can be weighted during aggregation, indicative of their relative importance 18
  • 19. Pair Wise Distance Calculation MR •It’s an O(nxn) problem. If there are 1 million transactions, we need to perform 1 trillion computation. • The work will be divided up between the reducers. If we have a 100 node Hadoop cluster with 10 reducers slots per node, each reducer will roughly perform 1 billion distance calculation. •How do we divide up the work? Use partitioned hashing. If h1 = hash(id1) and h2 = hash(id2), we use function of h1 and h2 as the key of the mapper output. For example f(h1,h2) = h1 << 10 | h2. •All the transactions with id hashed to h1 or h2 will end up with the same reducer. 19
  • 20. Partitioned Hashing • Code snippet from SameTypeSimilarity.java String partition = partitonOrdinal>= 0 ? items[partitonOrdinal] : "none"; hash = (items[idOrdinal].hashCode() % bucketCount + bucketCount) / 2 ; for (inti = 0; i<bucketCount; ++i) { if (i< hash){ hashPair = hash * 1000 + i; keyHolder.set(partition, hashPair,0); valueHolder.set("0" + value.toString()); } else { hashPair = i * 1000 + hash; keyHolder.set(partition, hashPair,1); valueHolder.set("1" + value.toString()); } context.write(keyHolder, valueHolder); 20
  • 21. Output of Distance MR • The output has 3 fields: the first transaction ID, second transaction ID and the distance 6JHQ79UA JSXNUV9R 5 6JHQ79UA Y1AWCM5P 89 6JHQ79UA UFS5ZM0K 172 21
  • 22. Nearest Neighbor MR • Next we need to find the nearest k neighbors of each data point. We essentially need the neighbors of a data point sorted by distance. •Use a technique called secondary sorting. We tag some extra data to the key which will force the key to be sorted by the data tagged as the mapper emits it’s key and value. •Going back to the output of the previous MR, this is how the mapper of this MR will emit key, value key -> (6JHQ79UA, 5) value -> (JSXNUV9R, 5) key -> (JSXNUV9R,, 5) value -> (6JHQ79UA, 5) key -> (6JHQ79UA, 89) value -> (Y1AWCM5P, 89) key -> (Y1AWCM5P, 89) value -> (6JHQ79UA, 172) 22
  • 23. Nearest Neighbor MR (contd) • On the reducer side when the reducer gets invoked, we will get a transaction ID as a key and a list of neighboring transaction ID and distance pair as the value •In the reducer, we iterate through the values and take the average distance and emit the transaction ID and average distance as output. We could use median also. 1IKVOMZE 5 1JI0A0UE, 173 1KWBJ4W3, 278 ........... XXXX7362, 538 •As expected we find the outlier we injected into the dataset having a very large average distance to it’s neighbor. 23
  • 24. Secondary Sorting • Define reducer partitioner using the base part of the key (transaction ID) ensuring all values for a key will be routed to the same reducer •Define grouping comparator using the base part of the key, ensuring all the values for a transaction ID will be passed in same reducer invocation •Sorting is based on both parts of the key i.e. transaction ID and the distance . 24
  • 25. How to Choose k • High or low values for k will cause large error a.k.a bias variance trade off •Small k -> low bias error -> high variance error •Large k -> low variance error -> high bias error •Find optimum k by experimenting with different values. 25
  • 26. Segmentation • In reality data might be segmented or clustered first and then outlier detection process run on the relevant cluster. •What is normal in one segment may be an outlier in another 26
  • 27. Fraud or Emerging Normal Behavior • We have been able to detect the outlier. But how do we know whether it’s a fraudulent transaction or emerging buying pattern. •Your credit card may have been compromised and someone is using it. Or you have fallen in love and decided to shower him or her with expensive high price ticket items. •We can’t really tell the difference, except that once there is enough data points for this emerging behavior, we won’t be getting these false positives from our analysis 27
  • 28. ThankYou Q&A pghosh@yahoo.com Big Data Consultant 28