More Related Content Similar to Architecting a Fraud Detection Application with Hadoop (20) More from DataWorks Summit (20) Architecting a Fraud Detection Application with Hadoop2. 2
• Intro
• Review Problem
• Quick overview of key technology
• High level architecture
• Deep Dive into NRT Processing
• Completing the Puzzle – Micro-batch, Ingest and Batch
Overview
©2014 Cloudera, Inc. All rights reserved.
3. 3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
• @gwenshap
Gwen Shapira
4. 4
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~5 years
• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro,
– Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo
– And working on a Sentry Patch
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
11. 11
• Typical Atomic Card Fraud Detection
• Ikea Meat Ball
• Multi Coupons Combinations
• OP or Negative Video Games Strategies
• Ad Serving
• Health Insurance Fraud
• Kid Coming Home From School
Review of the Problem
©2014 Cloudera, Inc. All rights reserved.
12. 12
How do we React
• Human Brain at Tennis
– Muscle Memory
– Reaction Thought
– Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
15. 15©2014 Cloudera, Inc. All rights reserved.
• Messages are organized into topics
• Producers push messages
• Consumers pull messages
• Kafka runs in a cluster. Nodes are called
brokers
The Basics
18. 18©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
19. 19©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
20. 20©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
21. 21©2014 Cloudera, Inc. All rights reserved.
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in
partition
Order retained with in
partition but not over
partitionsOffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Off sets are kept per
consumer group
23. 23
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver, Kafka
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
24. 24
Flume and/or Kafka
©2014 Cloudera, Inc. All rights reserved.
Flume
UpStream
Flume Source
Interceptor
Flume Channel
Flume Sink
Down Stream
Selector
Can Be KafkaCan Be KafkaCan Be Kafka
25. 25
Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events
©2014 Cloudera, Inc. All rights reserved.
27. 27
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
28. 28
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
31. 31
Spark Streaming and HBase
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
33. 33
Real-Time Event Processing Approach
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
35. 35
Focus on NRT First
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
NRT Event Processing with Context
36. 36
Streaming Architecture – NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Able to respond with
in 10s of
milliseconds
37. 37
Partitioned NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Topic
Partition A
Partition B
Partition C
Producer
Partitione
r
Producer
Partitione
r
Producer
Partitione
r
Custom Partitioner
Better use of local
memory
39. 39
Micro Batching
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Micro Batching
Micro Batching
Micro Batching
40. 40
Complex Topologies
©2014 Cloudera, Inc. All rights reserved.
Kafka
Initial Events Topic
Spark Streaming
KafkaDirect
Connection
Dag Topologies
Kafka
Initial Events Topic
Spark Streaming
Kafka Receivers Dag Topologies
Kafka Receivers
Kafka Receivers
• Manages Offset
• Stores Offset is RDD
• No longer needs HDFS for initial RDD check
pointing
• Lets Kafka Manage Offsets
• Uses HDFS for initial RDD recovery
1.3
1.2
41. 41
MicroBatch Bad-Input Handling
©2014 Cloudera, Inc. All rights reserved.
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – incoming events topic
Dag Topologies
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – bad events topic
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – resolved events topic
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – results topic
42. 42
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Ingestion
Ingestion
43. 43
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink
Sink
Sink
Sink
SolR
Flume Hbase Sink
Sink
Sink
Sink
HBase
44. 44
Reflective Thoughts
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Research and Searching
Editor's Notes This gives me a lot of perspective regarding the use of Hadoop Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
Kafka retains all messages for fixed amount of time.
Not waiting for acks from consumers.
The only metadata retained per consumer is the position in the log – the offset
So adding many consumers is cheap
On the other hand, consumers have more responsibility and are more challenging to implement correctly
And “batching” consumers is not a problem
3 partitions, each replicated 3 times. The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
can read from one or more partition leader. You can’t have two consumers in same group reading the same partition.
Leaders obviously do more work – but they are balanced between nodes
We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka. Does not require programming.