Real Time
Anomaly Detection
Patterns and reference architectures
Gwen Shapira, System Architect
©2014 Cloudera, Inc. All rights reserved.
Overview
• Intro
• Review Problem
• Quick overview of key technology
• High level architecture
• Deep Dive into NRT Processing
• Completing the Puzzle – Micro-batch, Ingest and Batch
©2014 Cloudera, Inc. All rights reserved.
Gwen Shapira
• 15 years of moving data
• Formerly consultant, engineer
• System Architect @ Confluent
• Kafka Committer
• @gwenshap
There’s a Book on That
Founded by creators of Kafka - @jaykreps, @nehanarkhede, @junrao
We help you gather, transport, organize, and analyze all of your stream data
What we offer
• Confluent Platform
• Kafka plus critical bug fixes not yet applied in Apache release
• Kafka ecosystem projects
• Enterprise support
• Training and Professional Services
©2014 Cloudera, Inc. All rights reserved.
The Problem
©2014 Cloudera, Inc. All rights reserved.
Credit Card Transaction Fraud
©2014 Cloudera, Inc. All rights reserved.
Coupon Fraud
©2014 Cloudera, Inc. All rights reserved.
Video Game Strategy
©2014 Cloudera, Inc. All rights reserved.
Health Insurance Fraud
©2014 Cloudera, Inc. All rights reserved.
How do we React
• Human Brain at Tennis
• Muscle Memory
• Reaction Thought
• Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
Overview of
Key Technologies
©2014 Cloudera, Inc. All Rights Reserved.
Kafka
©2014 Cloudera, Inc. All rights reserved.
The Basics
• Messages are organized into topics
• Producers push messages
• Consumers pull messages
• Kafka runs in a cluster. Nodes are called brokers
©2014 Cloudera, Inc. All rights reserved.
Topics, Partitions and Logs
©2014 Cloudera, Inc. All rights reserved.
Each partition is a log
©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in
partition
Order retained with in
partition but not over
partitionsOffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Off sets are kept per
consumer group
Consumer-Producer Pattern
Keeping Things Simple
• Consume records from Kafka Topic
• Filter, transform, join, lookups, aggregate
• Write to another Kafka Topic
• https://github.com/confluentinc/examples/tree/master/specific-avro-
consumer
Kafka Makes Streams Easy
• Producers partition the data
• Consumers load balance partitions
• Add / remove consumers any way you want
• Will work with any framework (or none!)
Coming Soon to Kafka Near You
• KafkaConnect - Export / Import for Kafka - 0.9.0 (Its here!)
• KStream
• Consumer-Producer client - Processor (0.10.0 - April?)
• DSLs:
• KStream (a bit like Spark) - (0.10.0 - April?)
• SQL - ???
KConnect - Its a thing
• Easy to add connectors to Kafka
• Existing connectors
• JDBC
• HDFS
• MySQL * 2
• ElasticSearch * 4
• Cassandra
• S3 * 2
• MQTT
• Twitter
• Kafka Connectors:
• http://www.confluent.io/developers/connectors
• http://docs.confluent.io/2.0.0/connect/index.html
• KStreams:
• https://github.com/gwenshap/kafka-
examples/blob/master/KafkaStreamsAvg
SparkStreaming
©2014 Cloudera, Inc. All rights reserved.
Spark Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
©2014 Cloudera, Inc. All rights reserved.
Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
Spark Streaming
Confidentiality Information Goes Here
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
Confidentiality Information Goes Here
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
Print
Stateful RDD 2
Stateful RDD 1
©2014 Cloudera, Inc. All rights reserved.
High Level Architecture
©2014 Cloudera, Inc. All rights reserved.
Real-Time Event Processing Approach
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Impa
la
Map/Redu
ce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of NRT
Changes and
Counters
Local Cache
Kafka
Clients:
(Swipe here!)
Web App
Adjust NRT Statistics
Yarn / Mesos
Analytics Layer
SolR
Client
Client
KStreams
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats Batch Time Adjustments
Review of
NRT
Changes and
Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App Kafka
HDFS
NoSQL
DWH
Connecor
Connector
KStream
Processor
Profile Updates
Model Updates
Transactions
Local
Store
Decisions
DWH
RedoLog
KStream
Processor
KStream
Processor
©2014 Cloudera, Inc. All rights reserved.
NRT Processing
©2014 Cloudera, Inc. All rights reserved.
Focus on NRT First
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Processor
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Impa
la
Map/Redu
ce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of NRT
Changes and
Counters
Local Cache
Kafka
Clients:
(Swipe here!)
Web App
Adjust NRT Statistics
©2014 Cloudera, Inc. All rights reserved.
Streaming Architecture – NRT Event
Processing
Kafka
Initial Events Topic
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Able to respond with in
10s of milliseconds
©2014 Cloudera, Inc. All rights reserved.
Partitioned NRT Event
Processing
Kafka
Initial Events Topic
Event Processing Logic
Local
Cache
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Topic
Partition A
Partition B
Partition C
Producer
Partitioner
Producer
Partitioner
Producer
Partitioner Custom Partitioner
Better use of local
memory
©2014 Cloudera, Inc. All rights reserved.
Questions?
http://confluent.io
@confluentInc
@gwenshap
gwen@confluent.io

Fraud Detection for Israel BigThings Meetup

  • 1.
    Real Time Anomaly Detection Patternsand reference architectures Gwen Shapira, System Architect
  • 2.
    ©2014 Cloudera, Inc.All rights reserved. Overview • Intro • Review Problem • Quick overview of key technology • High level architecture • Deep Dive into NRT Processing • Completing the Puzzle – Micro-batch, Ingest and Batch
  • 3.
    ©2014 Cloudera, Inc.All rights reserved. Gwen Shapira • 15 years of moving data • Formerly consultant, engineer • System Architect @ Confluent • Kafka Committer • @gwenshap
  • 4.
  • 5.
    Founded by creatorsof Kafka - @jaykreps, @nehanarkhede, @junrao We help you gather, transport, organize, and analyze all of your stream data What we offer • Confluent Platform • Kafka plus critical bug fixes not yet applied in Apache release • Kafka ecosystem projects • Enterprise support • Training and Professional Services
  • 6.
    ©2014 Cloudera, Inc.All rights reserved. The Problem
  • 7.
    ©2014 Cloudera, Inc.All rights reserved. Credit Card Transaction Fraud
  • 8.
    ©2014 Cloudera, Inc.All rights reserved. Coupon Fraud
  • 9.
    ©2014 Cloudera, Inc.All rights reserved. Video Game Strategy
  • 10.
    ©2014 Cloudera, Inc.All rights reserved. Health Insurance Fraud
  • 11.
    ©2014 Cloudera, Inc.All rights reserved. How do we React • Human Brain at Tennis • Muscle Memory • Reaction Thought • Reflective Meditation
  • 12.
    ©2014 Cloudera, Inc.All rights reserved. Overview of Key Technologies
  • 13.
    ©2014 Cloudera, Inc.All Rights Reserved. Kafka
  • 14.
    ©2014 Cloudera, Inc.All rights reserved. The Basics • Messages are organized into topics • Producers push messages • Consumers pull messages • Kafka runs in a cluster. Nodes are called brokers
  • 15.
    ©2014 Cloudera, Inc.All rights reserved. Topics, Partitions and Logs
  • 16.
    ©2014 Cloudera, Inc.All rights reserved. Each partition is a log
  • 17.
    ©2014 Cloudera, Inc.All rights reserved. Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  • 18.
    ©2014 Cloudera, Inc.All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 19.
    ©2014 Cloudera, Inc.All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 20.
    Consumers Consumer Group Y ConsumerGroup X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
  • 21.
  • 22.
    Keeping Things Simple •Consume records from Kafka Topic • Filter, transform, join, lookups, aggregate • Write to another Kafka Topic • https://github.com/confluentinc/examples/tree/master/specific-avro- consumer
  • 23.
    Kafka Makes StreamsEasy • Producers partition the data • Consumers load balance partitions • Add / remove consumers any way you want • Will work with any framework (or none!)
  • 24.
    Coming Soon toKafka Near You • KafkaConnect - Export / Import for Kafka - 0.9.0 (Its here!) • KStream • Consumer-Producer client - Processor (0.10.0 - April?) • DSLs: • KStream (a bit like Spark) - (0.10.0 - April?) • SQL - ???
  • 25.
    KConnect - Itsa thing • Easy to add connectors to Kafka • Existing connectors • JDBC • HDFS • MySQL * 2 • ElasticSearch * 4 • Cassandra • S3 * 2 • MQTT • Twitter
  • 27.
    • Kafka Connectors: •http://www.confluent.io/developers/connectors • http://docs.confluent.io/2.0.0/connect/index.html • KStreams: • https://github.com/gwenshap/kafka- examples/blob/master/KafkaStreamsAvg
  • 28.
  • 29.
    ©2014 Cloudera, Inc.All rights reserved. Spark Example 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  • 30.
    ©2014 Cloudera, Inc.All rights reserved. Spark Streaming Example 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
  • 31.
    Spark Streaming Confidentiality InformationGoes Here DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 32.
    Confidentiality Information GoesHere DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  • 33.
    ©2014 Cloudera, Inc.All rights reserved. High Level Architecture
  • 34.
    ©2014 Cloudera, Inc.All rights reserved. Real-Time Event Processing Approach Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Impa la Map/Redu ce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Adjust NRT Statistics
  • 35.
    Yarn / Mesos AnalyticsLayer SolR Client Client KStreams Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats Batch Time Adjustments Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Kafka HDFS NoSQL DWH Connecor Connector
  • 36.
  • 37.
    ©2014 Cloudera, Inc.All rights reserved. NRT Processing
  • 38.
    ©2014 Cloudera, Inc.All rights reserved. Focus on NRT First Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Processor Hbase / Memory Spark Streaming HDFS Hive/Impa la Map/Redu ce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Adjust NRT Statistics
  • 39.
    ©2014 Cloudera, Inc.All rights reserved. Streaming Architecture – NRT Event Processing Kafka Initial Events Topic Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Able to respond with in 10s of milliseconds
  • 40.
    ©2014 Cloudera, Inc.All rights reserved. Partitioned NRT Event Processing Kafka Initial Events Topic Event Processing Logic Local Cache HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Topic Partition A Partition B Partition C Producer Partitioner Producer Partitioner Producer Partitioner Custom Partitioner Better use of local memory
  • 41.
    ©2014 Cloudera, Inc.All rights reserved. Questions? http://confluent.io @confluentInc @gwenshap gwen@confluent.io

Editor's Notes

  • #4 This gives me a lot of perspective regarding the use of Hadoop
  • #16 Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
  • #17 Kafka retains all messages for fixed amount of time. Not waiting for acks from consumers. The only metadata retained per consumer is the position in the log – the offset So adding many consumers is cheap On the other hand, consumers have more responsibility and are more challenging to implement correctly And “batching” consumers is not a problem
  • #18 3 partitions, each replicated 3 times.
  • #19 The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #20 The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #21 can read from one or more partition leader. You can’t have two consumers in same group reading the same partition. Leaders obviously do more work – but they are balanced between nodes We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.