Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real Time
Fraud Detection
Patterns and reference architectures
Ted Malaska // PSA Gwen Shapira // Software
Engineer
2
• Intro
• Review Problem
• Quick overview of key technology
• High level architecture
• Deep Dive into NRT Processing
• ...
3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data
• Formerly consultant
• Now Cloudera Engineer:
– Sqoo...
4
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~5 years
• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro,
– Ki...
5
The Problem
©2014 Cloudera, Inc. All rights reserved.
6
Credit Card Transaction Fraud
©2014 Cloudera, Inc. All rights reserved.
7
Ikea Meat Balls
©2014 Cloudera, Inc. All rights reserved.
8
Coupon Fraud
©2014 Cloudera, Inc. All rights reserved.
9
Video Game Strategy
©2014 Cloudera, Inc. All rights reserved.
10
Health Insurance Fraud
©2014 Cloudera, Inc. All rights reserved.
11
• Typical Atomic Card Fraud Detection
• Ikea Meat Ball
• Multi Coupons Combinations
• OP or Negative Video Games Strate...
12
How do we React
• Human Brain at Tennis
– Muscle Memory
– Reaction Thought
– Reflective Meditation
©2014 Cloudera, Inc....
13
Overview of
Key Technologies
©2014 Cloudera, Inc. All rights reserved.
14
Kafka
©2014 Cloudera, Inc. All Rights Reserved.
15©2014 Cloudera, Inc. All rights reserved.
• Messages are organized into topics
• Producers push messages
• Consumers pul...
16©2014 Cloudera, Inc. All rights reserved.
Topics, Partitions and Logs
17©2014 Cloudera, Inc. All rights reserved.
Each partition is a log
18©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition ...
19©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
...
20©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
...
21©2014 Cloudera, Inc. All rights reserved.
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Parti...
22
Flume
23
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver, Kafka
Mas...
24
Flume and/or Kafka
©2014 Cloudera, Inc. All rights reserved.
Flume
UpStream
Flume Source
Interceptor
Flume Channel
Flum...
25
Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filte...
26
SparkStreaming
27
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2...
28
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2...
29
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Re...
30
DStream
DStream
DStreamSpark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Rec...
31
Spark Streaming and HBase
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Co...
32
High Level
Architecture
©2014 Cloudera, Inc. All rights reserved.
33
Real-Time Event Processing Approach
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR...
34
NRT Processing
©2014 Cloudera, Inc. All rights reserved.
35
Focus on NRT First
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I...
36
Streaming Architecture – NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka...
37
Partitioned NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Even...
38
Completing the
Puzzle
©2014 Cloudera, Inc. All rights reserved.
39
Micro Batching
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
Cli...
40
Complex Topologies
©2014 Cloudera, Inc. All rights reserved.
Kafka
Initial Events Topic
Spark Streaming
KafkaDirect
Con...
41
MicroBatch Bad-Input Handling
©2014 Cloudera, Inc. All rights reserved.
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – inc...
42
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientCl...
43
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partiti...
44
Reflective Thoughts
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster ...
©2014 Cloudera, Inc. All rights reserved.
Upcoming SlideShare
Loading in …5
×

Fraud Detection Architecture

8,089 views

Published on

This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.

Published in: Data & Analytics
  • Be the first to comment

Fraud Detection Architecture

  1. 1. Real Time Fraud Detection Patterns and reference architectures Ted Malaska // PSA Gwen Shapira // Software Engineer
  2. 2. 2 • Intro • Review Problem • Quick overview of key technology • High level architecture • Deep Dive into NRT Processing • Completing the Puzzle – Micro-batch, Ingest and Batch Overview ©2014 Cloudera, Inc. All rights reserved.
  3. 3. 3©2014 Cloudera, Inc. All rights reserved. • 15 years of moving data • Formerly consultant • Now Cloudera Engineer: – Sqoop Committer – Kafka – Flume • @gwenshap Gwen Shapira
  4. 4. 4 • Ted Malaska (PSA at Cloudera) • Hadoop for ~5 years • Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch • Co-Author to O’Reilly Hadoop Application Architectures • Worked with about 70 companies in 8 countries • Marvel Fan Boy • Runner Hello ©2014 Cloudera, Inc. All rights reserved.
  5. 5. 5 The Problem ©2014 Cloudera, Inc. All rights reserved.
  6. 6. 6 Credit Card Transaction Fraud ©2014 Cloudera, Inc. All rights reserved.
  7. 7. 7 Ikea Meat Balls ©2014 Cloudera, Inc. All rights reserved.
  8. 8. 8 Coupon Fraud ©2014 Cloudera, Inc. All rights reserved.
  9. 9. 9 Video Game Strategy ©2014 Cloudera, Inc. All rights reserved.
  10. 10. 10 Health Insurance Fraud ©2014 Cloudera, Inc. All rights reserved.
  11. 11. 11 • Typical Atomic Card Fraud Detection • Ikea Meat Ball • Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud • Kid Coming Home From School Review of the Problem ©2014 Cloudera, Inc. All rights reserved.
  12. 12. 12 How do we React • Human Brain at Tennis – Muscle Memory – Reaction Thought – Reflective Meditation ©2014 Cloudera, Inc. All rights reserved.
  13. 13. 13 Overview of Key Technologies ©2014 Cloudera, Inc. All rights reserved.
  14. 14. 14 Kafka ©2014 Cloudera, Inc. All Rights Reserved.
  15. 15. 15©2014 Cloudera, Inc. All rights reserved. • Messages are organized into topics • Producers push messages • Consumers pull messages • Kafka runs in a cluster. Nodes are called brokers The Basics
  16. 16. 16©2014 Cloudera, Inc. All rights reserved. Topics, Partitions and Logs
  17. 17. 17©2014 Cloudera, Inc. All rights reserved. Each partition is a log
  18. 18. 18©2014 Cloudera, Inc. All rights reserved. Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  19. 19. 19©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  20. 20. 20©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  21. 21. 21©2014 Cloudera, Inc. All rights reserved. Consumers Consumer Group Y Consumer Group X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
  22. 22. 22 Flume
  23. 23. 23 Sources Interceptors Selectors Channels Sinks Flume Agent Short Intro to Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  24. 24. 24 Flume and/or Kafka ©2014 Cloudera, Inc. All rights reserved. Flume UpStream Flume Source Interceptor Flume Channel Flume Sink Down Stream Selector Can Be KafkaCan Be KafkaCan Be Kafka
  25. 25. 25 Interceptors • Mask fields • Validate information against external source • Extract fields • Modify data format • Filter or split events ©2014 Cloudera, Inc. All rights reserved.
  26. 26. 26 SparkStreaming
  27. 27. 27 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
  28. 28. 28 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  29. 29. 29 DStream DStream DStream Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  30. 30. 30 DStream DStream DStreamSpark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  31. 31. 31 Spark Streaming and HBase ©2014 Cloudera, Inc. All rights reserved. Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  32. 32. 32 High Level Architecture ©2014 Cloudera, Inc. All rights reserved.
  33. 33. 33 Real-Time Event Processing Approach ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App
  34. 34. 34 NRT Processing ©2014 Cloudera, Inc. All rights reserved.
  35. 35. 35 Focus on NRT First ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App NRT Event Processing with Context
  36. 36. 36 Streaming Architecture – NRT Event Processing ©2014 Cloudera, Inc. All rights reserved. Flume Source Flume Source Kafka Initial Events Topic Flume Source Flume Interceptor Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Able to respond with in 10s of milliseconds
  37. 37. 37 Partitioned NRT Event Processing ©2014 Cloudera, Inc. All rights reserved. Flume Source Flume Source Kafka Initial Events Topic Flume Source Flume Interceptor Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Topic Partition A Partition B Partition C Producer Partitione r Producer Partitione r Producer Partitione r Custom Partitioner Better use of local memory
  38. 38. 38 Completing the Puzzle ©2014 Cloudera, Inc. All rights reserved.
  39. 39. 39 Micro Batching ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Micro Batching Micro Batching Micro Batching
  40. 40. 40 Complex Topologies ©2014 Cloudera, Inc. All rights reserved. Kafka Initial Events Topic Spark Streaming KafkaDirect Connection Dag Topologies Kafka Initial Events Topic Spark Streaming Kafka Receivers Dag Topologies Kafka Receivers Kafka Receivers • Manages Offset • Stores Offset is RDD • No longer needs HDFS for initial RDD check pointing • Lets Kafka Manage Offsets • Uses HDFS for initial RDD recovery 1.3 1.2
  41. 41. 41 MicroBatch Bad-Input Handling ©2014 Cloudera, Inc. All rights reserved. 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – incoming events topic Dag Topologies 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – bad events topic 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – resolved events topic 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – results topic
  42. 42. 42 Ingestion ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Ingestion Ingestion
  43. 43. 43 Ingestion ©2014 Cloudera, Inc. All rights reserved. Flume HDFS Sink Kafka Cluster Topic Partition A Partition B Partition C Sink Sink Sink HDFS Flume SolR Sink Sink Sink Sink SolR Flume Hbase Sink Sink Sink Sink HBase
  44. 44. 44 Reflective Thoughts ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Research and Searching
  45. 45. ©2014 Cloudera, Inc. All rights reserved.

×