Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Outlier and fraud detection using Hadoop


Published on

Using Hadoop for out lier and fraud detection with proximity based techniques

Published in: Technology, Education

Outlier and fraud detection using Hadoop

  1. 1. Outlier and Fraud Detection Big Data Science Meetup July, 2012 Fremont, CA 1
  2. 2. About Me PranabGhosh•25+ years in the industry• Worked with various technologies and platforms• Worked for startups, Fortune 500 and anything in between• Big data consultant for the last few years• Currently consultant in Apple• Active Blogger• Blog site:• Owner of several open source projects• Project site:• Passionate about data and finding patterns in data 2
  3. 3. My Open Source Hadooop Projects•Recommendation engine (sifarish) based on contentbased and social recommendation algorithms•Fraud analytic (beymani) using proximity anddistribution model based algorithms. Today’s talk isrelated to this project.•Web click stream analytic (visitante) for descriptiveand predictive analytics 3
  4. 4. Outlier Detection• Data that do not conform to the normal and expectedpatterns are outliers•Wide range of applications in various domainsincluding finance, security, intrusion detection in cybersecurity•Criteria for what constitutes an outlier depend on theproblem domain•Typically involve large amount data, which may beunstructured, creating opportunity of using big datatechnologies 4
  5. 5. Data Type•Instance data, where the outlier detection algorithmoperates on individual instance of data e.g., particularcredit transaction involving large amount of moneypurchasing unusual product•Sequence data with temporal or spatial relationship.The goal of outlier detection is to find unusualsequence e.g., intrusion detection and cyber security.•Our focus is on outlier detection for instance datausing Hadoop. We will be using credit card transactiondata as an example 5
  6. 6. Challenges•Defining the normal regions in a data set is the mainchallenge. The boundary between normal and outliermay not be crisply defined.• Definition of normal behavior may evolve with time.What is normal today may be considered anomalous infuture and vice versa.• In many cases the malicious adversaries adaptthemselves to make the operations look like normal andtry to stay undetected 6
  7. 7. Instance Based Analysis•Supervised classification techniques using labeledtraining data with normal and outlier data e.g., Bayesianfiltering, Neural Network, Support Vector Machine etc.Not very reliable because of lack of labeled outlier data•Multivariate probability distribution based. Data pointwith low probability are likely to be outliers• Proximity based approaches. Distance between datapoints are calculated in a multi dimensional featurespace•Relative density based. Density is inverse of averagedistance to neighbors... 7
  8. 8. Instance Based Analysis (contd)•Shared nearest neighbor based. We consider thenumber of shared neighbor between neighboring datapoints.•Clustering based. Data points with poor clustermembership are likely outliers.•Information theory based. Inclusion of outlier causesincrease in entropy of the data set. We identify datapoints whose removal causes large drop in entropy inthe data set•… and many more techniques.. 8
  9. 9. Sequence Based Analysis•Having a list of known sequences corresponding tomalicious behavior and detecting those in the data. Doesnot works well for new and unknown threats•Markov chain which considers observable states andprobability of transition between states.•Hidden Markov Model, where the system has both hiddenand observable states 9
  10. 10. Model Based vs Memory Based•As you may have observed, with some of the methods,we build a model from the training data and apply themodel to detect outliers•With the other methods, we don’t build a model but usethe existing data directly to detect outliers•The technique we will discuss today is based on the laterapproach i.e., memory based. 10
  11. 11. Average Distance to k Neighbors•We find the distance between each pair of points.This has computational complexity of O(n x n)•For any point we find k nearest neighbors, where k isan user configured number• For each point, we find the average distance to the knearest neighbors• Identify data points with high average distance to it’sneighbors. Outliers will have high average distance toneighbors•We can select data points above some thresholdaverage distance or choose the top n based onavearge distance 11
  12. 12. Big Data EcosystemCredit: 12
  13. 13. When to Use HadoopCredit: 13
  14. 14. Map Reduce Data FlowCredit : Yahoo Developer Network 14
  15. 15. Hadoop at 30000 ft•MapReduce– Parallel processing pattern. Functionalprogramming model. Implemented as a framework, withuser supplied map and reduce code.•HDFS– Replicated and partitioned file system. Sequentialaccess only. Writes are append only.•Data Locality – Code moves where the data resides andgets executed there.•IO Bound – Typically IO bound (disk and network) 15
  16. 16. Credit Card Transaction We have a very simple data model. Each credit cardtransaction contains the following 4 attributes1. Transaction ID2. Time of the day3. Money spent4. Vendor typeHere are some examples. The last one is an outlier,injected into the data set.YX66AJ9U 1025 20.47drug store98ZCM6B1 1910 55.50 restaurantXXXX7362 0100 1875.40 jewellery store 16
  17. 17. Distance Calculation• For numerical attribute (e.g. money amount), distance isthe difference in values• For unranked categorical attribute (e.g. vendor type), thedistance is 0 if they are same and 1 otherwise. Thedistances could be set softly between 0 and 1 (e.g productcolor).•If the unranked categorical attributes have hierarchicalrelationship, the minimum no of edges to traverse fromone node to the other could be used as distance (e.g.,vendor type hierarchy) 17
  18. 18. Distance Aggregation• We aggregate across all attributes to find the netdistance between 2 entities•Different ways to aggregate: Euclidean, Manhattan.Attributes can be weighted during aggregation, indicativeof their relative importance 18
  19. 19. Pair Wise Distance Calculation MR•It’s an O(nxn) problem. If there are 1 million transactions,we need to perform 1 trillion computation.• The work will be divided up between the reducers. If wehave a 100 node Hadoop cluster with 10 reducers slotsper node, each reducer will roughly perform 1 billiondistance calculation.•How do we divide up the work? Use partitioned hashing.If h1 = hash(id1) and h2 = hash(id2), we use function of h1and h2 as the key of the mapper output. For examplef(h1,h2) = h1 << 10 | h2.•All the transactions with id hashed to h1 or h2 will end upwith the same reducer. 19
  20. 20. Partitioned Hashing• Code snippet from SameTypeSimilarity.javaString partition = partitonOrdinal>= 0 ? items[partitonOrdinal] : "none"; hash = (items[idOrdinal].hashCode() % bucketCount + bucketCount) / 2 ; for (inti = 0; i<bucketCount; ++i) { if (i< hash){ hashPair = hash * 1000 + i; keyHolder.set(partition, hashPair,0); valueHolder.set("0" + value.toString()); } else { hashPair = i * 1000 + hash; keyHolder.set(partition, hashPair,1); valueHolder.set("1" + value.toString()); } context.write(keyHolder, valueHolder); 20
  21. 21. Output of Distance MR• The output has 3 fields: the first transaction ID, secondtransaction ID and the distance6JHQ79UA JSXNUV9R 56JHQ79UA Y1AWCM5P 896JHQ79UA UFS5ZM0K 172 21
  22. 22. Nearest Neighbor MR• Next we need to find the nearest k neighbors of eachdata point. We essentially need the neighbors of a datapoint sorted by distance.•Use a technique called secondary sorting. We tag someextra data to the key which will force the key to be sortedby the data tagged as the mapper emits it’s key and value.•Going back to the output of the previous MR, this is howthe mapper of this MR will emit key, valuekey -> (6JHQ79UA, 5) value -> (JSXNUV9R, 5)key -> (JSXNUV9R,, 5) value -> (6JHQ79UA, 5)key -> (6JHQ79UA, 89) value -> (Y1AWCM5P, 89)key -> (Y1AWCM5P, 89) value -> (6JHQ79UA, 172) 22
  23. 23. Nearest Neighbor MR (contd)• On the reducer side when the reducer gets invoked, wewill get a transaction ID as a key and a list of neighboringtransaction ID and distance pair as the value•In the reducer, we iterate through the values and take theaverage distance and emit the transaction ID and averagedistance as output. We could use median also.1IKVOMZE 51JI0A0UE, 1731KWBJ4W3, 278...........XXXX7362, 538•As expected we find the outlier we injected into thedataset having a very large average distance to it’sneighbor. 23
  24. 24. Secondary Sorting• Define reducer partitioner using the base part of the key(transaction ID) ensuring all values for a key will be routedto the same reducer•Define grouping comparator using the base part of thekey, ensuring all the values for a transaction ID will bepassed in same reducer invocation•Sorting is based on both parts of the key i.e. transactionID and the distance. 24
  25. 25. How to Choose k• High or low values for k will cause large error a.k.a biasvariance trade off•Small k -> low bias error -> high variance error•Large k -> low variance error -> high bias error•Find optimum k by experimenting with different values. 25
  26. 26. Segmentation• In reality data might be segmented or clustered first andthen outlier detection process run on the relevant cluster.•What is normal in one segment may be an outlier inanother 26
  27. 27. Fraud or Emerging Normal Behavior• We have been able to detect the outlier. But how do weknow whether it’s a fraudulent transaction or emergingbuying pattern.•Your credit card may have been compromised andsomeone is using it. Or you have fallen in love anddecided to shower him or her with expensive high priceticket items.•We can’t really tell the difference, except that once thereis enough data points for this emerging behavior, we won’tbe getting these false positives from our analysis 27
  28. 28. ThankYou Q& Big Data Consultant 28