1. Outlier and Fraud Detection
Big Data Science Meetup
July, 2012
Fremont, CA
1
2. About Me
PranabGhosh
•25+ years in the industry
• Worked with various technologies and platforms
• Worked for startups, Fortune 500 and anything in between
• Big data consultant for the last few years
• Currently consultant in Apple
• Active Blogger
• Blog site: http://pkghosh.wordpress.com/
• Owner of several open source projects
• Project site: https://github.com/pranab
• Passionate about data and finding patterns in data
2
3. My Open Source Hadooop Projects
•Recommendation engine (sifarish) based on content
based and social recommendation algorithms
•Fraud analytic (beymani) using proximity and
distribution model based algorithms. Today’s talk is
related to this project.
•Web click stream analytic (visitante) for descriptive
and predictive analytics
3
4. Outlier Detection
• Data that do not conform to the normal and expected
patterns are outliers
•Wide range of applications in various domains
including finance, security, intrusion detection in cyber
security
•Criteria for what constitutes an outlier depend on the
problem domain
•Typically involve large amount data, which may be
unstructured, creating opportunity of using big data
technologies
4
5. Data Type
•Instance data, where the outlier detection algorithm
operates on individual instance of data e.g., particular
credit transaction involving large amount of money
purchasing unusual product
•Sequence data with temporal or spatial relationship.
The goal of outlier detection is to find unusual
sequence e.g., intrusion detection and cyber security
.
•Our focus is on outlier detection for instance data
using Hadoop. We will be using credit card transaction
data as an example
5
6. Challenges
•Defining the normal regions in a data set is the main
challenge. The boundary between normal and outlier
may not be crisply defined.
• Definition of normal behavior may evolve with time.
What is normal today may be considered anomalous in
future and vice versa.
• In many cases the malicious adversaries adapt
themselves to make the operations look like normal and
try to stay undetected
6
7. Instance Based Analysis
•Supervised classification techniques using labeled
training data with normal and outlier data e.g., Bayesian
filtering, Neural Network, Support Vector Machine etc.
Not very reliable because of lack of labeled outlier data
•Multivariate probability distribution based. Data point
with low probability are likely to be outliers
• Proximity based approaches. Distance between data
points are calculated in a multi dimensional feature
space
•Relative density based. Density is inverse of average
distance to neighbors.
.
.
7
8. Instance Based Analysis (contd)
•Shared nearest neighbor based. We consider the
number of shared neighbor between neighboring data
points.
•Clustering based. Data points with poor cluster
membership are likely outliers.
•Information theory based. Inclusion of outlier causes
increase in entropy of the data set. We identify data
points whose removal causes large drop in entropy in
the data set
•… and many more techniques.
.
8
9. Sequence Based Analysis
•Having a list of known sequences corresponding to
malicious behavior and detecting those in the data. Does
not works well for new and unknown threats
•Markov chain which considers observable states and
probability of transition between states.
•Hidden Markov Model, where the system has both hidden
and observable states
9
10. Model Based vs Memory Based
•As you may have observed, with some of the methods,
we build a model from the training data and apply the
model to detect outliers
•With the other methods, we don’t build a model but use
the existing data directly to detect outliers
•The technique we will discuss today is based on the later
approach i.e., memory based.
10
11. Average Distance to k Neighbors
•We find the distance between each pair of points.
This has computational complexity of O(n x n)
•For any point we find k nearest neighbors, where k is
an user configured number
• For each point, we find the average distance to the k
nearest neighbors
• Identify data points with high average distance to it’s
neighbors. Outliers will have high average distance to
neighbors
•We can select data points above some threshold
average distance or choose the top n based on
avearge distance
11
15. Hadoop at 30000 ft
•MapReduce– Parallel processing pattern. Functional
programming model. Implemented as a framework, with
user supplied map and reduce code.
•HDFS– Replicated and partitioned file system. Sequential
access only. Writes are append only.
•Data Locality – Code moves where the data resides and
gets executed there.
•IO Bound – Typically IO bound (disk and network)
15
16. Credit Card Transaction
We have a very simple data model. Each credit card
transaction contains the following 4 attributes
1. Transaction ID
2. Time of the day
3. Money spent
4. Vendor type
Here are some examples. The last one is an outlier,
injected into the data set.
YX66AJ9U 1025 20.47drug store
98ZCM6B1 1910 55.50 restaurant
XXXX7362 0100 1875.40 jewellery store
16
17. Distance Calculation
• For numerical attribute (e.g. money amount), distance is
the difference in values
• For unranked categorical attribute (e.g. vendor type), the
distance is 0 if they are same and 1 otherwise. The
distances could be set softly between 0 and 1 (e.g product
color).
•If the unranked categorical attributes have hierarchical
relationship, the minimum no of edges to traverse from
one node to the other could be used as distance (e.g.,
vendor type hierarchy)
17
18. Distance Aggregation
• We aggregate across all attributes to find the net
distance between 2 entities
•Different ways to aggregate: Euclidean, Manhattan.
Attributes can be weighted during aggregation, indicative
of their relative importance
18
19. Pair Wise Distance Calculation MR
•It’s an O(nxn) problem. If there are 1 million transactions,
we need to perform 1 trillion computation.
• The work will be divided up between the reducers. If we
have a 100 node Hadoop cluster with 10 reducers slots
per node, each reducer will roughly perform 1 billion
distance calculation.
•How do we divide up the work? Use partitioned hashing.
If h1 = hash(id1) and h2 = hash(id2), we use function of h1
and h2 as the key of the mapper output. For example
f(h1,h2) = h1 << 10 | h2.
•All the transactions with id hashed to h1 or h2 will end up
with the same reducer.
19
21. Output of Distance MR
• The output has 3 fields: the first transaction ID, second
transaction ID and the distance
6JHQ79UA JSXNUV9R 5
6JHQ79UA Y1AWCM5P 89
6JHQ79UA UFS5ZM0K 172
21
22. Nearest Neighbor MR
• Next we need to find the nearest k neighbors of each
data point. We essentially need the neighbors of a data
point sorted by distance.
•Use a technique called secondary sorting. We tag some
extra data to the key which will force the key to be sorted
by the data tagged as the mapper emits it’s key and value.
•Going back to the output of the previous MR, this is how
the mapper of this MR will emit key, value
key -> (6JHQ79UA, 5) value -> (JSXNUV9R, 5)
key -> (JSXNUV9R,, 5) value -> (6JHQ79UA, 5)
key -> (6JHQ79UA, 89) value -> (Y1AWCM5P, 89)
key -> (Y1AWCM5P, 89) value -> (6JHQ79UA, 172)
22
23. Nearest Neighbor MR (contd)
• On the reducer side when the reducer gets invoked, we
will get a transaction ID as a key and a list of neighboring
transaction ID and distance pair as the value
•In the reducer, we iterate through the values and take the
average distance and emit the transaction ID and average
distance as output. We could use median also.
1IKVOMZE 5
1JI0A0UE, 173
1KWBJ4W3, 278
...........
XXXX7362, 538
•As expected we find the outlier we injected into the
dataset having a very large average distance to it’s
neighbor.
23
24. Secondary Sorting
• Define reducer partitioner using the base part of the key
(transaction ID) ensuring all values for a key will be routed
to the same reducer
•Define grouping comparator using the base part of the
key, ensuring all the values for a transaction ID will be
passed in same reducer invocation
•Sorting is based on both parts of the key i.e. transaction
ID and the distance
.
24
25. How to Choose k
• High or low values for k will cause large error a.k.a bias
variance trade off
•Small k -> low bias error -> high variance error
•Large k -> low variance error -> high bias error
•Find optimum k by experimenting with different values.
25
26. Segmentation
• In reality data might be segmented or clustered first and
then outlier detection process run on the relevant cluster.
•What is normal in one segment may be an outlier in
another
26
27. Fraud or Emerging Normal Behavior
• We have been able to detect the outlier. But how do we
know whether it’s a fraudulent transaction or emerging
buying pattern.
•Your credit card may have been compromised and
someone is using it. Or you have fallen in love and
decided to shower him or her with expensive high price
ticket items.
•We can’t really tell the difference, except that once there
is enough data points for this emerging behavior, we won’t
be getting these false positives from our analysis
27
28. ThankYou
Q&A
pghosh@yahoo.com
Big Data Consultant
28