Fraud Detection with Hadoop

1© Cloudera, Inc. All rights reserved.
Hadoop Application
Architectures: a Fraud Tutorial
Mark Grover, Cloudera Developer and Author

Questions? tiny.cloudera.com/app-arch-questions
About the book
§@hadooparchbook
§hadooparchitecturebook.com
§github.com/hadooparchitecturebook
§slideshare.com/hadooparchbook

About the presenter
§Software Engineer on
Spark at Cloudera
§Committer on Apache
Bigtop, PMC member on
Apache Sentry(incubating)
§Contributor to Apache
Spark, Hadoop, Hive,
Sqoop, Pig, Flume
Mark Grover

Case Study Overview
Fraud Detection

Credit Card Transaction Fraud

Health Insurance Fraud

Network anomaly detection

How Do We React
§ Human Brain at Tennis
- Muscle Memory
- Fast Thought
- Slow Thought

§ Muscle memory – for quick action (e.g. reject events)
- Real time alerts
- Real time dashboard
§ Platform for automated learning and improvement
- Real time (fast thought)
- Batch (slow thought)
§ Slow thought
- Audit trail and analytics for human feedback
Back to network anomaly detection

Use case
Network Anomaly Detection

§ “Beaconing” – Malware “phoning home”.
Use Case – Network Anomaly Detection
myserver.com
badhost.com

§ Distributed Denial of Service attacks.
myserver.com
badhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.com

§ Network intrusion attempts – attempted or successful break-ins to a network.

All of these require the ability to detect deviations from known patterns.

Other Use Cases
Could be used for any system requiring detection of anomalous events:
Credit card transactions Market transactions Internet of Things
Etc…

§ Standard protocol defining a format for capturing network traffic flow through
routers.
§ Captures network data as flow records. These flow records provide information
on source and destination IP addresses, traffic volumes, ports, etc.
§ We can use this data to detect normal and anomalous patterns in network
traffic.
Input Data – NetFlow

Challenges of Hadoop Implementation

Challenges - Architectural Considerations
§ Storing state (for real-time decisions):
- Local Caching? Distributed caching (e.g. Memcached)? Distributed storage (e.g. HBase)?
§ Profile Storage
- HDFS? HBase?
§ Ingestion frameworks (for events):
- Flume? Kafka? Sqoop?
§ Processing engines (for background processing):
- Storm? Spark? Trident? Flume interceptors? Kafka Streams?
§ Data Modeling
§ Orchestration
- Do we still need it for real-time systems?

Case Study
Requirements
Overview

Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.

What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch

But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency

Events
Buffer
Anomaly
detector
Alerts/Events
Profiles/Rules
Long term
storage
Buffer
NRT
engine
Batch processing
(ML/manual
adjustments)
Reporting info
Analytics interface
Real time
dashboard
Storage
NRT processing
External
systems
Batch
processing
High level architecture

High level architecture
Events
Buffer
Anomaly
detector
Alerts/Events
Profiles/Rules
Long term
storage
Buffer
NRT
engine
Batch processing
(ML/manual
adjustments)
Reporting info
Analytics interface
Real time
dashboard
Storage
NRT processing
External
systems
Batch
processing

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Full Architecture

Storage Layer Considerations
§Two likely choices for long-term storage of data:

Data Storage Layer Choices
§Stores data directly as files
§Fast scans
§Poor random reads/writes
§Stores data as Hfiles on HDFS
§Slow scans
§Fast random reads/writes

Storage Considerations
§ Batch processing and reporting requires access to all raw and processed data.
§ Stream processing and alerting requires quickly fetching and saving profiles.

Data Storage Layer Choices
§For ingesting raw data.
§Batch processing, also
some stream processing.
§For reading/writing profiles.
§Fast random reads and writes
facilitates quick access to
large number of profiles.

New Hadoop Storage Option
Use case break up
Structured Data
SQL + Scan Use Cases
Unstructured
Data
Deep Storage
Scan Use Cases
Fixed Columns
Schemas
SQL + Scan Use
Cases
Any Type of Column
Schemas
Gets / Puts / Micro
Scans

But…
Is HBase fast enough for fetching profiles?
HBase and/or
Memory StoreFetching & Updating Profiles/Rules
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Hadoop Cluster I
HBase and/or
Memory Store

Caching Options
§ Local in-memory cache (on-heap, off-heap).
§ Remote cache
- Distributed cache (Memcached, Oracle Coherence, etc.)
- HBase BlockCache

§ Allows us to keep recently used data blocks in memory.
§ Note that this will require recent versions of HBase and specific configurations
on our cluster.
Caching – HBase BlockCache

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Pattern detection
Batch Time
Adjustments
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Local Cache (Profiles)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Our Storage Choices

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Pattern detection
Batch Time
Adjustments
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Reminder: Full Architecture

Processing
Hadoop Cluster II
Storage
Batch Processing
HDFS
Impala
Map/Reduce
Spark
Spark Streaming

Two Kinds of Processing
Streaming Batch
Spark Streaming
Batch Processing
Impala
Map/Reduce
Spark

Stream Processing
Considerations

Streaming use-cases

Streaming Use-cases
▪ Ingestion (most relevant in our use-case)
▪ Simple transformations
- Decision (e.g. anomaly detection)
- Enrichment (e.g. add a state based on zipcode)
▪ Advanced usage
- Machine Learning
- Windowing

#1 - Simple ingestion
Buffer
Event e Stream
Processing Long term
storage
Event e

#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store

#2 - Decision
Buffer
Event e Stream
Processing Storage
Event e’
e’ = e + decision
Rules

#3 – Advanced usage
Buffer
Event e Stream
Processing Storage
Event e’
e’ = aggregation or
windowed aggregation
Model

#1 – Simple Ingestion
1. Zero transformation
- No transformation, plain ingest
- Keep the original format – SequenceFile, Text, etc.
- Allows to store data that may have errors in the schema
2. Format transformation
- Simply change the format of the field
- To a structured format, say, Avro, for example
- Can do schema validation
3. Atomic transformation
- Mask a credit card number

#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store
Need to store the
context
somewhere

Where to store the context?
1. Locally Broadcast Cached Dim Data
- Local to Process (On Heap, Off Heap)
- Local to Node (Off Process)
2. Partitioned Cache
- Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)

#1a - Locally broadcast cached data
Could be
On heap or Off heap

#1b - Off process cached data
Data is cached on the
node, outside of
process. Potentially in
an external system like
Rocks DB

#2 - Partitioned cache data
Data is partitioned
based on field(s) and
then cached

#3 - External fetch
Data fetched from
external system

Partitioned cache + external

Streaming semantics

Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve

Semantics of our architecture
Source System 1
Destination
systemSource System 2
Source System 3
Ingest Extract Streaming
engine
Push
Message broker

Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu

Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu
De-duplication at file level
Semantics at key/record level

Which streaming
engine to choose?

Spark Streaming
▪ Micro batch based architecture
▪ Allows stateful transformations
▪ Feature rich
- Windowing
- Sessionization
- ML
- SQL (Structured Streaming)

Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batc
h
Second
Batch

DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batc
h
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Spark Streaming

Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()

Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()

Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)

Flink - ABS
Operator
Buffer

Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A Hit
Barrier 1B
Still Behind

Operator
Buffer
Flink - ABS
Both Barriers
Hit
Operator
Buffer
Barrier 1A Hit
Barrier 1B
Still Behind

Operator
Buffer
Flink - ABS Both Barriers
Hit
Operator
Buffer Barrier is
combined and
can move on
Buffer can be
flushed out

Storm
▪ Old school
▪ Didn’t manage state – had to use Trident
▪ No good support for batch processing

Samza
▪ Good integration with Kafka
▪ Doesn’t support batch
▪ Forked by Kafka Streams

Flume
▪ Well integrated with the Hadoop ecosystem
▪ Allowed interceptors (for simple transformations)
▪ Supports buffering
- Memory
- File
- Kafka
▪ But no real fault-tolerance
▪ No state management

Others
§ Apache Apex
§ Kafka Streams
§ Heron

Apache Beam
§ Abstraction on top of Streaming Engines
§ Best support for Google Dataflow

Why Spark Streaming?
§ There’s one engine to know.
§ Micro-batching helps you scale reliably.
§ Exactly once counting
§ Hadoop ecosystem integration is baked in.
§ No risk of data loss.
§ It’s easy to debug and run.
§ State management
§ Huge eco-system backing
§ You have access to ML libraries.
§ You can use SQL where needed.
§ Wide-spread use and hardened

Why batch in a NRT use case?
§For background processing
- To be later used for quick decision making
§Exploratory analysis
- Machine Learning
§Automated rule changes
- Based on new patterns
§Analytics
- Who’s attacking us?

Spark
§ The New Kid that isn’t that New Anymore
§ Easily 10x less code
§ Extremely Easy and Powerful API
§ Very good for iterative processing (machine learning, graph processing, etc.)
§ Scala, Java, and Python support
§ RDDs
§ Has Graph, ML and SQL engines built on it
§ R integration (SparkR)

Impala
§ MPP Style SQL Engine on top of Hadoop
§ Very Fast
§ High Concurrency
§ Analytical windowing functions

Batch processing in
our use-case

Batch processing in fraud-detection
§ Reporting
- How many events come daily?
- How does it compare to last year?
§ Machine Learning
- Automated rules changes
§ Other processing
- Tracking device history and updating profiles

Reporting
§ Need a fast JDBC-compliant SQL framework
§ Impala or Hive or Presto?
§ We choose Impala
- Fast
- Highly concurrent
- JDBC/ODBC support

Machine Learning
§ Options
- MLLib (with Spark)
- Oryx
- H2O
- Mahout
§ We recommend MLLib because of larger use of Spark in our architecture.

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Pattern detection
Batch Time
Adjustments
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Pattern detection
Batch Time
Adjustments
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Storage

HDFS
Raw and Processed
Events
HBase
Profiles
(Disk/Cache)
Local Cache
Profiles

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Pattern detection
Batch Time
Adjustments
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
NRT Processing/Alerting

Flume Source
Flume Source
Kafka
Transactions Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase Client
Kafka
Replies Topic
KafkaConsumer
KafkaProducer
HBase
Kafka
Updates Topic
Kafka
Rules Topic

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Pattern detection
Batch Time
Adjustments
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Ingestion

Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume Solr Sink
Sink
Sink
Sink
Solr
Flume HBase Sink
Sink
Sink
Sink
HBase

Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Pattern detection
Batch Time
Adjustments
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Processing

Hadoop Cluster II
Storage
Batch Processing
HDFS
Impala
Map/Reduce
Spark
Spark Streaming
Reporting
Stream
Processing
Batch, ML,
etc.

Ingest and Near Real-
Time Alerting
Considerations

The basic workflow
Network
Device
Network
Events
Queue
Event
Handler
Data Store
Fetch & Update
Profiles
Here is an event. Is it suspicious?
Queue
1 2
3

The Queue
§ What makes Apache Kafka a good choice?
- Low latency
- High throughput
- Partitioned and replicated
- Easy to plug to applications
1

The Basics
§Messages are organized into topics
§Producers push messages
§Consumers pull messages
§Kafka runs in a cluster. Nodes are called brokers

Topics, Partitions and Logs

Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained within
partition
Order retained with in
partition but not over
partitions
OffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Offsets are kept per
consumer group

In our use-case
§ Topic for profile updates
§ Topic for current events
§ Topic for alerts
§ Topic for rule updates
Partitioned by device ID

Easier than you think
§ Kafka provides load balancing and availability
- Add processes when needed – they will get their share of partitions
- You control partition assignment strategy
- If a process crashes, partitions will get automatically reassigned
- You control when an event is “done”
- Send events safely and asynchronously
§ So you focus on the use-case, not the framework

Choosing Processor Deployment
Flume
•Part of Hadoop
•Easy to configure
•Not so easy to scale
Yarn
•Part of Hadoop
•Implement processor
inside Yarn app
•Add resources as
needed - Scalable
•Not easy.
Chef / Puppet /
Ansible
•Part of Devops
•Somewhat easy to
scale
•Can be dockerized
Our choice for
this example
Solid choiceAdvanced
users

Inside The Processor
Flume Source
Flume Source
Kafka
Events Topic
Flume Source
Flume Interceptor
Local
Memory
HBase Client
Kafka
Alerts Topic
KafkaConsumer
KafkaProducer
HBase
Kafka
Profile/Rules Updates Topic
Kafka
Rules Topic

Scaling the Processor
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Local
Memory
HBase
Client
Kafka
Alerts Topic
HBaseKafkaConsumer
KafkaProducer
Events Topic
Partition A
Partition B
Partition C
Producer
Partitioner
Producer
Partitioner
Producer
Partitioner
Better use of local
memory
Kafka
Profile/Rules Topic
Profile/Rules Topic
Partition A

Ingest – Gateway to deep analytics
§ Data needs to end up in:
- SparkStreaming
- Can read directly from Kafka
- HBase
- HDFS
3

Considerations for Ingest
§ Async – nothing can slow down the immediate reaction
- Always ingest the data from a queue in separate thread or process
to avoid write-latency
§ Useful end-points – integrates with variety of data sources and sinks
§ Supports standard formats and possibly some transformations
- Data format inside the queue can be different than in the data store
For example Avro -> Parquet
§ Cross data center

More considerations - Safety
§ Reliable – data loss is not acceptable
- Either “at-least-once” or “exactly-once”
- When is “exactly-once” possible?
§ Error handling – reasonable reaction to unreasonable data
§ Security – Authentication, authorization, audit, lineage

Anti-Pattern
Using complex event processing framework for ingest
Don’t do that. Ingest is different.

Good Patterns
§ Ingest all things. Events, evidence, alerts
- It has potential value
- Especially when troubleshooting
- And for data scientists
§ Make sure you take your schema with you
- Yes, your data probably has a schema
- Everyone accessing the data (web apps, real time, stream, batch) will need to
know about it
- So make it accessible in a centralized schema registry

Choosing Ingest
Flume
•Part of Hadoop
•Key-value based
•Supports HDFS and HBase
•Serializers provide conversions
•At-least once delivery guarantee
•Proven in huge production
environments
Kafka Connect
•Part of Apache Kafka
•Schema preserving
•Supports HDFS + Hive
•Well integrated support for Parquet,
Avro and JSON
•Exactly-once delivery guarantee
•Easy to manage and scale
•Provides back-pressure
Our choice for
this example

Ingestion - Flume
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink
Sink
Sink
Sink
SolR
Flume HBase Sink
Sink
Sink
Sink
HBase

Fraud Detection with Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Fraud Detection with Hadoop

More from markgrover

Recently uploaded

Fraud Detection with Hadoop