This document discusses architectures for fraud detection applications using Hadoop. It provides an overview of requirements for such an application, including the need for real-time alerts and batch processing. It proposes using Kafka for ingestion due to its high throughput and partitioning. HBase and HDFS would be used for storage, with HBase better supporting random access for profiles. The document outlines using Flume, Spark Streaming, and HBase for near real-time processing and alerting on incoming events. Batch processing would use HDFS, Impala, and Spark. Caching profiles in memory is also suggested to improve performance.
Architecting applications with Hadoop - Fraud Detection
1. Hadoop Application
Architectures: Fraud
Detection
Strata + Hadoop World, New York 2015
tiny.cloudera.com/app-arch-new-york
Gwen Shapira | @gwenshap
Jonathan Seidman | @jseidman
Ted Malaska | @ted_malaska
Mark Grover | @mark_grover
2. Logistics
§ Break at 10:30-11:00 AM
§ Questions at the end of each section
§ Slides at tiny.cloudera.com/app-arch-new-york
§ Code at https://github.com/hadooparchitecturebook/fraud-
detection-tutorial
4. About the presenters
§ Principal Solutions Architect
at Cloudera
§ Previously, lead architect at
FINRA
§ Contributor to Apache
Hadoop, HBase, Flume,
Avro, Pig and Spark
§ Senior Solutions Architect/
Partner Enablement at
Cloudera
§ Contributor to Apache
Sqoop.
§ Previously, Technical Lead
on the big data team at
Orbitz, co-founder of the
Chicago Hadoop User
Group and Chicago Big Data
Ted Malaska Jonathan Seidman
5. About the presenters
§ System Architect at
Confluent
§ Committer on Apache
Kafka, Sqoop
§ Contributor to Apache
Flume
§ Software Engineer at
Cloudera
§ Committer on Apache
Bigtop, PMC member on
Apache Sentry(incubating)
§ Contributor to Apache
Spark, Hadoop, Hive,
Sqoop, Pig, Flume
Gwen Shapira Mark Grover
10. How Do We React
§ Human Brain at Tennis
- Muscle Memory
- Fast Thought
- Slow Thought
11. § Rejected transactions
§ Real time alerts
§ Real time dashboard
§ Platform for automated learning and improvement – real time and batch
§ Audit trail and analytics for human feedback
Results:
23. Some Definitions
§ Real-time – well, not really. Most of what we’re talking about here is…
§ Near real-time:
- Stream processing – continual processing of incoming data.
- Alerting – immediate (well, as fast as possible) responses to incoming events.
Getting closer to real-time.
Data
Sources Extract
Transform
Load
Stream
Processing
millisecondsseconds
Alerts
!
Typical Hadoop Processing – Minutes to Hours Fraud Detection
27. Requirements – Alerts
§ Need to be able to respond to incoming events quickly (milliseconds).
§ Need high throughput and low latency.
§ Processing requirements are minimal.
32. Requirements – Stream Processing
§ A few seconds to minutes response times.
§ Need to be able to detect trends, thresholds, etc. across profiles.
§ Quick processing more important than 100% accuracy.
By Sprocket2cog (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
33. Requirements – Batch Processing
§ Non real-time, “off-line”, exploratory processing and reporting.
§ Data needs to be available for analysts and users to analyze with tools of choice –
SQL, MapReduce, etc.
§ Results need to be able to feed back into NRT processing to adjust rules, etc.
"Parmigiano reggiano factory". Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:Parmigiano_reggiano_factory.jpg#/media/File:Parmigiano_reggiano_factory.jpg
38. Data Storage Layer Choices
§ Stores data directly as files
§ Fast scans
§ Poor random reads/writes
§ Stores data as Hfiles on HDFS
§ Slow scans
§ Fast random reads/writes
39. Storage Considerations
§ Batch processing and reporting requires access to all raw and processed data.
§ Stream processing and alerting requires quickly fetching and saving profiles.
40. Data Storage Layer Choices
§ For ingesting raw data.
§ Batch processing, also
some stream processing.
§ For reading/writing profiles.
§ Fast random reads and writes
facilitates quick access to
large number of profiles.
41. But…
Is HBase fast enough for fetching profiles?
HBase and/or
Memory StoreFetching & Updating Profiles/Rules
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Hadoop Cluster I
HBase and/or
Memory Store
44. § Allows us to keep recently used data blocks in memory.
§ Note that this will require recent versions of HBase and specific configurations
on our cluster.
Caching – HBase BlockCache
45. Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Local Cache (Profiles)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Our Storage Choices
58. Where does data come from?
§ Device profile database – IP address, access attempts…
§ Device history – previous access attempts, remote IP’s…
§ Transaction stream
Our Application
60. The Queue
§ What makes Apache Kafka a good choice?
- Low latency
- High throughput
- Partitioned and replicated
- Easy to plug to applications
61. The Basics
§ Messages are organized into topics
§ Producers push messages
§ Consumers pull messages
§ Kafka runs in a cluster. Nodes are called brokers
67. Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained within
partition
Order retained with in
partition but not over
partitions
OffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Offsets are kept per
consumer group
68. In our use-case
§ Topic for profile updates
§ Topic for current events
§ Topic for alerts
§ Topic for rule updates
Partitioned by device ID
69. The event handler
§ Very basic pattern:
- Consume a record
- “do something to it”
- Produce zero, one, or more results
Kafka Processor Pattern
70. Easier than you think
§ Kafka provides load balancing and availability
- Add processes when needed – they will get their share of partitions
- You control partition assignment strategy
- If a process crashes, partitions will get automatically reassigned
- You control when an event is “done”
- Send events safely and asynchronously
§ So you focus on the use-case, not the framework
74. Choosing Processor Deployment
Flume
• Part of Hadoop
• Easy to configure
• Not so easy to scale
Yarn
• Part of Hadoop
• Implement processor
inside Yarn app
• Add resources as
needed - Scalable
• Not easy.
Chef / Puppet /
Ansible
• Part of Devops
• Easy to configure
• Somewhat easy to
scale
• Can be dockerized
Our choice for
this example
Solid choiceAdvanced
users
75. Ingest – Gateway to deep analytics
§ Data needs to end up in:
- SparkStreaming
- Can read directly from Kafka
- HBase
- HDFS
76. Considerations for Ingest
§ Async – nothing can slow down the immediate reaction
- Always ingest the data from a queue in separate thread or process
to avoid write-latency
§ Useful end-points – integrates with variety of data sources and sinks
§ Supports standard formats and possibly some transformations
- Data format inside the queue can be different than in the data store
For example Avro - Parquet
§ Cross data center
77. More considerations - Safety
§ Reliable – data loss is not acceptable
- Either “at-least-once” or “exactly-once”
- When is “exactly-once” possible?
§ Error handling – reasonable reaction to unreasonable data
§ Security – Authentication, authorization, audit, lineage
79. Good Patterns
§ Ingest all things. Events, evidence, alerts
- It has potential value
- Especially when troubleshooting
- And for data scientists
§ Make sure you take your schema with you
- Yes, your data probably has a schema
- Everyone accessing the data (web apps, real time, stream, batch) will need to
know about it
- So make it accessible in a centralized schema registry
80. Choosing Ingest
Flume
• Part of Hadoop
• Key-value based
• Supports HDFS and HBase
• Easy to configure
• Serializers provide conversions
• At-least once delivery guarantee
• Proven in huge production
environments
Copycat
• Part of Apache Kafka
• Schema preserving
• Supports HDFS + Hive
• Well integrated support for Parquet,
Avro and JSON
• Exactly-once delivery guarantee
• Easy to manage and scale
• Provides back-pressure
Our choice for
this example
84. Do we really need HBase?
§ What if… Kafka would store all information used for real-time processing?
Network
Device
Network
Events
Queue Event handler
Fetch Update
Profiles
Here is an event. Is is authorized?
85. Log Compaction in Kafka
§ Kafka stores a “change log” of profiles
§ But we want the latest state
§ log.cleanup.policy = {delete / compact}
86. Do we really need Spark Streaming?
§ What if… we could do low-latency single event processing and complex stream
processing analytics in the same system?
Network
Device
Network
Events
Queue Event handler Data StoreQueue
Spark Streaming
Adjust rules and statistics
Do we really need two of these?
87. Is complex processing of single events possible?
§ The challenge:
- Low latency vs high throughput
- No data loss
- Windows, aggregations, joins
- Late data
88. Solutions?
§ Apache Flink
- Process each event, checkpoint in batches
- Local and distributed state, also checkpointed
§ Apache Samza
- State is persisted to Kafka (Change log pattern)
- Local state cached in RocksDB
§ Kafka Streams
- Similar to Samza
- But in simple library that is part of Apache Kafka.
- Late data as updates to state
96. Why Spark Streaming?
§ There’s one engine to know.
§ Micro-batching helps you scale reliably.
§ Exactly once counting
§ Hadoop ecosystem integration is baked in.
§ No risk of data loss.
§ It’s easy to debug and run.
§ State management
§ Huge eco-system backing
§ You have access to ML libraries.
§ You can use SQL where needed.
97. Spark Streaming Example
1. val conf = new SparkConf().setMaster(local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream(localhost, 9999)
4. val words = lines.flatMap(_.split( ))
5. val pairs = words.map(word = (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
98. Spark Streaming Example
1. val conf = new SparkConf().setMaster(local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split( ))
5. val pairs = words.map(word = (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
104. HBase-‐Spark
Module
val hbaseContext = new HBaseContext(sc, config);
hbaseContext.bulkDelete[Array[Byte]](rdd,
tableName,
putRecord = new Delete(putRecord),
4);
val hbaseContext = new HBaseContext(sc, config)
rdd.hbaseBulkDelete(hbaseContext,
tableName,
putRecord = new Delete(putRecord),
4)
110. New Hadoop Storage Option
Use
case
break
up
Structured Data
SQL + Scan Use Cases
Unstructured
Data
Deep Storage
Scan Use Cases
Fixed Columns
Schemas
SQL + Scan Use
Cases
Any Type of Column
Schemas
Gets / Puts / Micro
Scans
111. New Hadoop Storage Option
Use
case
break
up
Structured Data
SQL + Scan Use Cases
Unstructured
Data
Deep Storage
Scan Use Cases
Fixed Columns
Schemas
SQL + Scan Use
Cases
Any Type of Column
Schemas
Gets / Puts / Micro
Scans
112. Real-Time Analytics with Kudu Demo
Generator Kafka
Spark
Streaming Kudu
Impala
SparkSQL
Spark
MlLib
SparkSQL
115. Why batch in a NRT use case?
§ For background processing
- To be later used for quick decision making
§ Exploratory analysis
- Machine Learning
§ Automated rule changes
- Based on new patterns
§ Analytics
- Who’s attacking us?
119. MapReduce
§ Oldie but goody
§ Restrictive Framework / Innovative Work Arounds
§ Extreme Batch
120. MapReduce Basic High Level
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File
Remotereadfor
allbut1node
122. Spark
§ The New Kid that isn’t that New Anymore
§ Easily 10x less code
§ Extremely Easy and Powerful API
§ Very good for iterative processing (machine learning, graph processing, etc.)
§ Scala, Java, and Python support
§ RDDs
125. Spark - DAG
Filter KeyBy
KeyBy
TextFile
TextFile
Join Filter Take
Good
Good
Good
Good
Good
Good
Good-Replay
Good-Replay
Good-Replay
Good
Good-Replay
Good
Good-Replay
Lost Block
Replay
Good-Replay
Lost Block
Good
Future
Future
Future
Future
126. Is Spark replacing Hadoop?
§ Spark is an execution engine
- Much faster than MR
- Easier to program
§ Spark is replacing MR
127. Impala
§ MPP Style SQL Engine on top of Hadoop
§ Very Fast
§ High Concurrency
§ Analytical windowing functions
128. Impala – Broadcast Join
Impala Daemon
Smaller Table
Data Block
100% Cached
Smaller Table
Smaller Table
Data Block
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
130. Presto
§ Facebook’s SQL engine replacement for Hive
§ Written in Java
§ Doesn’t build on top of MapReduce
§ Very few commercial vendors supporting it
132. Batch processing in fraud-detection
§ Reporting
- How many events come daily?
- How does it compare to last year?
§ Machine Learning
- Automated rules changes
§ Complex event processing
- Tracking device history and updating profiles
133. Reporting
§ Need a fast JDBC-compliant SQL framework
§ Impala or Hive or Presto?
§ We choose Impala
- Fast
- Highly concurrent
- JDBC/ODBC support
134. Machine Learning
§ Options
- MLLib (with Spark)
- Oryx
- H2O
- Mahout
§ We recommend MLLib because of larger use of Spark in our architecture.
135. Other Batch Processing
§ Options:
- MapReduce
- Spark
§ We choose Spark
- Much faster than MR
- Easier to write applications due to higher level API
- Re-use of code between streaming and batch
136. Do we really need HDFS?
§ What if… Impala, Spark, and Hive could efficiently scan and analyze data in
HBase?