SlideShare a Scribd company logo
1©	Cloudera,	Inc.	All	rights	reserved.
Hadoop	Application	
Architectures:	a	Fraud	Tutorial
Mark	Grover,	Cloudera	Developer	and	Author
Questions? tiny.cloudera.com/app-arch-questions
About the book
§@hadooparchbook
§hadooparchitecturebook.com
§github.com/hadooparchitecturebook
§slideshare.com/hadooparchbook
Questions? tiny.cloudera.com/app-arch-questions
About the presenter
§Software Engineer on
Spark at Cloudera
§Committer on Apache
Bigtop, PMC member on
Apache Sentry(incubating)
§Contributor to Apache
Spark, Hadoop, Hive,
Sqoop, Pig, Flume
Mark Grover
Case Study Overview
Fraud Detection
Questions? tiny.cloudera.com/app-arch-questions
Credit Card Transaction Fraud
Questions? tiny.cloudera.com/app-arch-questions
Health Insurance Fraud
Questions? tiny.cloudera.com/app-arch-questions
Network anomaly detection
Questions? tiny.cloudera.com/app-arch-questions
How Do We React
§ Human Brain at Tennis
- Muscle Memory
- Fast Thought
- Slow Thought
Questions? tiny.cloudera.com/app-arch-questions
§ Muscle memory – for quick action (e.g. reject events)
- Real time alerts
- Real time dashboard
§ Platform for automated learning and improvement
- Real time (fast thought)
- Batch (slow thought)
§ Slow thought
- Audit trail and analytics for human feedback
Back to network anomaly detection
Use case
Network Anomaly Detection
Questions? tiny.cloudera.com/app-arch-questions
§ “Beaconing” – Malware “phoning home”.
Use Case – Network Anomaly Detection
myserver.com
badhost.com
Questions? tiny.cloudera.com/app-arch-questions
§ Distributed Denial of Service attacks.
Use Case – Network Anomaly Detection
myserver.com
badhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.combadhost.com
Questions? tiny.cloudera.com/app-arch-questions
Use Case – Network Anomaly Detection
§ Network intrusion attempts – attempted or successful break-ins to a network.
Questions? tiny.cloudera.com/app-arch-questions
Use Case – Network Anomaly Detection
All of these require the ability to detect deviations from known patterns.
Questions? tiny.cloudera.com/app-arch-questions
Other Use Cases
Could be used for any system requiring detection of anomalous events:
Credit card transactions Market transactions Internet of Things
Etc…
Questions? tiny.cloudera.com/app-arch-questions
§ Standard protocol defining a format for capturing network traffic flow through
routers.
§ Captures network data as flow records. These flow records provide information
on source and destination IP addresses, traffic volumes, ports, etc.
§ We can use this data to detect normal and anomalous patterns in network
traffic.
Input Data – NetFlow
Questions? tiny.cloudera.com/app-arch-questions
Challenges of Hadoop Implementation
Questions? tiny.cloudera.com/app-arch-questions
Challenges - Architectural Considerations
§ Storing state (for real-time decisions):
- Local Caching? Distributed caching (e.g. Memcached)? Distributed storage (e.g. HBase)?
§ Profile Storage
- HDFS? HBase?
§ Ingestion frameworks (for events):
- Flume? Kafka? Sqoop?
§ Processing engines (for background processing):
- Storm? Spark? Trident? Flume interceptors? Kafka Streams?
§ Data Modeling
§ Orchestration
- Do we still need it for real-time systems?
Case Study
Requirements
Overview
Questions? tiny.cloudera.com/app-arch-questions
Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.
Questions? tiny.cloudera.com/app-arch-questions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
Questions? tiny.cloudera.com/app-arch-questions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
Questions? tiny.cloudera.com/app-arch-questions
But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency
High level architecture
Questions? tiny.cloudera.com/app-arch-questions
Events
Buffer
Anomaly
detector
Alerts/Events
Profiles/Rules
Long term
storage
Buffer
NRT
engine
Batch processing
(ML/manual
adjustments)
Reporting info
Analytics interface
Real time
dashboard
Storage
NRT processing
External
systems
Batch
processing
High level architecture
Questions? tiny.cloudera.com/app-arch-questions
High level architecture
Events
Buffer
Anomaly
detector
Alerts/Events
Profiles/Rules
Long term
storage
Buffer
NRT
engine
Batch processing
(ML/manual
adjustments)
Reporting info
Analytics interface
Real time
dashboard
Storage
NRT processing
External
systems
Batch
processing
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Full Architecture
Storage Layer
Considerations
Questions? tiny.cloudera.com/app-arch-questions
Storage Layer Considerations
§Two likely choices for long-term storage of data:
Questions? tiny.cloudera.com/app-arch-questions
Data Storage Layer Choices
§Stores data directly as files
§Fast scans
§Poor random reads/writes
§Stores data as Hfiles on HDFS
§Slow scans
§Fast random reads/writes
Questions? tiny.cloudera.com/app-arch-questions
Storage Considerations
§ Batch processing and reporting requires access to all raw and processed data.
§ Stream processing and alerting requires quickly fetching and saving profiles.
Questions? tiny.cloudera.com/app-arch-questions
Data Storage Layer Choices
§For ingesting raw data.
§Batch processing, also
some stream processing.
§For reading/writing profiles.
§Fast random reads and writes
facilitates quick access to
large number of profiles.
Questions? tiny.cloudera.com/app-arch-questions
New Hadoop Storage Option
Use	case	break	up
Structured Data
SQL + Scan Use Cases
Unstructured
Data
Deep Storage
Scan Use Cases
Fixed Columns
Schemas
SQL + Scan Use
Cases
Any Type of Column
Schemas
Gets / Puts / Micro
Scans
Questions? tiny.cloudera.com/app-arch-questions
New Hadoop Storage Option
Use	case	break	up
Structured Data
SQL + Scan Use Cases
Unstructured
Data
Deep Storage
Scan Use Cases
Fixed Columns
Schemas
SQL + Scan Use
Cases
Any Type of Column
Schemas
Gets / Puts / Micro
Scans
Questions? tiny.cloudera.com/app-arch-questions
But…
Is HBase fast enough for fetching profiles?
HBase and/or
Memory StoreFetching & Updating Profiles/Rules
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Hadoop Cluster I
HBase and/or
Memory Store
Questions? tiny.cloudera.com/app-arch-questions
Caching Options
§ Local in-memory cache (on-heap, off-heap).
§ Remote cache
- Distributed cache (Memcached, Oracle Coherence, etc.)
- HBase BlockCache
Questions? tiny.cloudera.com/app-arch-questions
§ Allows us to keep recently used data blocks in memory.
§ Note that this will require recent versions of HBase and specific configurations
on our cluster.
Caching – HBase BlockCache
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Local Cache (Profiles)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Our Storage Choices
Processing
Considerations
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Reminder: Full Architecture
Questions? tiny.cloudera.com/app-arch-questions
Processing
Hadoop Cluster II
Storage
Batch Processing
HDFS
Impala
Map/Reduce
Spark
NRT/Stream Processing
Spark Streaming
Questions? tiny.cloudera.com/app-arch-questions
Two Kinds of Processing
Streaming Batch
NRT/Stream Processing
Spark Streaming
Batch Processing
Impala
Map/Reduce
Spark
Stream Processing
Considerations
Questions? tiny.cloudera.com/app-arch-questions
Streaming use-cases
Questions? tiny.cloudera.com/app-arch-questions
Streaming Use-cases
▪ Ingestion (most relevant in our use-case)
▪ Simple transformations
- Decision (e.g. anomaly detection)
- Enrichment (e.g. add a state based on zipcode)
▪ Advanced usage
- Machine Learning
- Windowing
Questions? tiny.cloudera.com/app-arch-questions
#1 - Simple ingestion
Buffer
Event e Stream
Processing Long term
storage
Event e
Questions? tiny.cloudera.com/app-arch-questions
#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store
Questions? tiny.cloudera.com/app-arch-questions
#2 - Decision
Buffer
Event e Stream
Processing Storage
Event e’
e’ = e + decision
Rules
Questions? tiny.cloudera.com/app-arch-questions
#3 – Advanced usage
Buffer
Event e Stream
Processing Storage
Event e’
e’ = aggregation or
windowed aggregation
Model
Questions? tiny.cloudera.com/app-arch-questions
#1 – Simple Ingestion
1. Zero transformation
- No transformation, plain ingest
- Keep the original format – SequenceFile, Text, etc.
- Allows to store data that may have errors in the schema
2. Format transformation
- Simply change the format of the field
- To a structured format, say, Avro, for example
- Can do schema validation
3. Atomic transformation
- Mask a credit card number
Questions? tiny.cloudera.com/app-arch-questions
#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store
Need to store the
context
somewhere
Questions? tiny.cloudera.com/app-arch-questions
Where to store the context?
1. Locally Broadcast Cached Dim Data
- Local to Process (On Heap, Off Heap)
- Local to Node (Off Process)
2. Partitioned Cache
- Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
Questions? tiny.cloudera.com/app-arch-questions
#1a - Locally broadcast cached data
Could be
On heap or Off heap
Questions? tiny.cloudera.com/app-arch-questions
#1b - Off process cached data
Data is cached on the
node, outside of
process. Potentially in
an external system like
Rocks DB
Questions? tiny.cloudera.com/app-arch-questions
#2 - Partitioned cache data
Data is partitioned
based on field(s) and
then cached
Questions? tiny.cloudera.com/app-arch-questions
#3 - External fetch
Data fetched from
external system
Questions? tiny.cloudera.com/app-arch-questions
Partitioned cache + external
Questions? tiny.cloudera.com/app-arch-questions
Streaming semantics
Questions? tiny.cloudera.com/app-arch-questions
Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve
Questions? tiny.cloudera.com/app-arch-questions
Semantics of our architecture
Source System 1
Destination
systemSource System 2
Source System 3
Ingest Extract Streaming
engine
Push
Message broker
Questions? tiny.cloudera.com/app-arch-questions
Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu
Questions? tiny.cloudera.com/app-arch-questions
Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu
De-duplication at file level
Semantics at key/record level
Questions? tiny.cloudera.com/app-arch-questions
Which streaming
engine to choose?
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming
▪ Micro batch based architecture
▪ Allows stateful transformations
▪ Feature rich
- Windowing
- Sessionization
- ML
- SQL (Structured Streaming)
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batc
h
Second
Batch
Questions? tiny.cloudera.com/app-arch-questions
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batc
h
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Spark Streaming
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming Example
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
Questions? tiny.cloudera.com/app-arch-questions
Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)
Questions? tiny.cloudera.com/app-arch-questions
Flink - ABS
Operator
Buffer
Questions? tiny.cloudera.com/app-arch-questions
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A Hit
Barrier 1B
Still Behind
Questions? tiny.cloudera.com/app-arch-questions
Operator
Buffer
Flink - ABS
Both Barriers
Hit
Operator
Buffer
Barrier 1A Hit
Barrier 1B
Still Behind
Questions? tiny.cloudera.com/app-arch-questions
Operator
Buffer
Flink - ABS Both Barriers
Hit
Operator
Buffer Barrier is
combined and
can move on
Buffer can be
flushed out
Questions? tiny.cloudera.com/app-arch-questions
Storm
▪ Old school
▪ Didn’t manage state – had to use Trident
▪ No good support for batch processing
Questions? tiny.cloudera.com/app-arch-questions
Samza
▪ Good integration with Kafka
▪ Doesn’t support batch
▪ Forked by Kafka Streams
Questions? tiny.cloudera.com/app-arch-questions
Flume
▪ Well integrated with the Hadoop ecosystem
▪ Allowed interceptors (for simple transformations)
▪ Supports buffering
- Memory
- File
- Kafka
▪ But no real fault-tolerance
▪ No state management
Questions? tiny.cloudera.com/app-arch-questions
Others
§ Apache Apex
§ Kafka Streams
§ Heron
Questions? tiny.cloudera.com/app-arch-questions
Apache Beam
§ Abstraction on top of Streaming Engines
§ Best support for Google Dataflow
Questions? tiny.cloudera.com/app-arch-questions
Why Spark Streaming?
§ There’s one engine to know.
§ Micro-batching helps you scale reliably.
§ Exactly once counting
§ Hadoop ecosystem integration is baked in.
§ No risk of data loss.
§ It’s easy to debug and run.
§ State management
§ Huge eco-system backing
§ You have access to ML libraries.
§ You can use SQL where needed.
§ Wide-spread use and hardened
Processing
Batch
Questions? tiny.cloudera.com/app-arch-questions
Why batch in a NRT use case?
§For background processing
- To be later used for quick decision making
§Exploratory analysis
- Machine Learning
§Automated rule changes
- Based on new patterns
§Analytics
- Who’s attacking us?
Questions? tiny.cloudera.com/app-arch-questions
Spark
§ The New Kid that isn’t that New Anymore
§ Easily 10x less code
§ Extremely Easy and Powerful API
§ Very good for iterative processing (machine learning, graph processing, etc.)
§ Scala, Java, and Python support
§ RDDs
§ Has Graph, ML and SQL engines built on it
§ R integration (SparkR)
Questions? tiny.cloudera.com/app-arch-questions
Impala
§ MPP Style SQL Engine on top of Hadoop
§ Very Fast
§ High Concurrency
§ Analytical windowing functions
Batch processing in
our use-case
Questions? tiny.cloudera.com/app-arch-questions
Batch processing in fraud-detection
§ Reporting
- How many events come daily?
- How does it compare to last year?
§ Machine Learning
- Automated rules changes
§ Other processing
- Tracking device history and updating profiles
Questions? tiny.cloudera.com/app-arch-questions
Reporting
§ Need a fast JDBC-compliant SQL framework
§ Impala or Hive or Presto?
§ We choose Impala
- Fast
- Highly concurrent
- JDBC/ODBC support
Questions? tiny.cloudera.com/app-arch-questions
Machine Learning
§ Options
- MLLib (with Spark)
- Oryx
- H2O
- Mahout
§ We recommend MLLib because of larger use of Spark in our architecture.
Overall Architecture
Overview
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Storage
Questions? tiny.cloudera.com/app-arch-questions
HDFS
Raw and Processed
Events
HBase
Profiles
(Disk/Cache)
Local Cache
Profiles
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
NRT Processing/Alerting
Questions? tiny.cloudera.com/app-arch-questions
Flume Source
Flume Source
Kafka
Transactions Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase Client
Kafka
Replies Topic
KafkaConsumer
KafkaProducer
HBase
Kafka
Updates Topic
Kafka
Rules Topic
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Ingestion
Questions? tiny.cloudera.com/app-arch-questions
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume Solr Sink
Sink
Sink
Sink
Solr
Flume HBase Sink
Sink
Sink
Sink
HBase
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Processing
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
HDFS
Impala
Map/Reduce
Spark
NRT/Stream Processing
Spark Streaming
Reporting
Stream
Processing
Batch, ML,
etc.
Thank you!
Ingest and Near Real-
Time Alerting
Considerations
Questions? tiny.cloudera.com/app-arch-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
Questions? tiny.cloudera.com/app-arch-questions
The basic workflow
Network
Device
Network
Events
Queue
Event
Handler
Data Store
Fetch & Update
Profiles
Here is an event. Is it suspicious?
Queue
1 2
3
Questions? tiny.cloudera.com/app-arch-questions
The Queue
§ What makes Apache Kafka a good choice?
- Low latency
- High throughput
- Partitioned and replicated
- Easy to plug to applications
1
Questions? tiny.cloudera.com/app-arch-questions
The Basics
§Messages are organized into topics
§Producers push messages
§Consumers pull messages
§Kafka runs in a cluster. Nodes are called brokers
Questions? tiny.cloudera.com/app-arch-questions
Topics, Partitions and Logs
Questions? tiny.cloudera.com/app-arch-questions
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained within
partition
Order retained with in
partition but not over
partitions
OffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Offsets are kept per
consumer group
Questions? tiny.cloudera.com/app-arch-questions
In our use-case
§ Topic for profile updates
§ Topic for current events
§ Topic for alerts
§ Topic for rule updates
Partitioned by device ID
Questions? tiny.cloudera.com/app-arch-questions
Easier than you think
§ Kafka provides load balancing and availability
- Add processes when needed – they will get their share of partitions
- You control partition assignment strategy
- If a process crashes, partitions will get automatically reassigned
- You control when an event is “done”
- Send events safely and asynchronously
§ So you focus on the use-case, not the framework
Questions? tiny.cloudera.com/app-arch-questions
Choosing Processor Deployment
Flume
•Part of Hadoop
•Easy to configure
•Not so easy to scale
Yarn
•Part of Hadoop
•Implement processor
inside Yarn app
•Add resources as
needed - Scalable
•Not easy.
Chef / Puppet /
Ansible
•Part of Devops
•Easy to configure
•Somewhat easy to
scale
•Can be dockerized
Our choice for
this example
Solid choiceAdvanced
users
Questions? tiny.cloudera.com/app-arch-questions
Inside The Processor
Flume Source
Flume Source
Kafka
Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase Client
Kafka
Alerts Topic
KafkaConsumer
KafkaProducer
HBase
Kafka
Profile/Rules Updates Topic
Kafka
Rules Topic
Questions? tiny.cloudera.com/app-arch-questions
Scaling the Processor
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Alerts Topic
HBaseKafkaConsumer
KafkaProducer
Events Topic
Partition A
Partition B
Partition C
Producer
Partitioner
Producer
Partitioner
Producer
Partitioner
Better use of local
memory
Kafka
Profile/Rules Topic
Profile/Rules Topic
Partition A
Questions? tiny.cloudera.com/app-arch-questions
Ingest – Gateway to deep analytics
§ Data needs to end up in:
- SparkStreaming
- Can read directly from Kafka
- HBase
- HDFS
3
Questions? tiny.cloudera.com/app-arch-questions
Considerations for Ingest
§ Async – nothing can slow down the immediate reaction
- Always ingest the data from a queue in separate thread or process
to avoid write-latency
§ Useful end-points – integrates with variety of data sources and sinks
§ Supports standard formats and possibly some transformations
- Data format inside the queue can be different than in the data store
For example Avro -> Parquet
§ Cross data center
Questions? tiny.cloudera.com/app-arch-questions
More considerations - Safety
§ Reliable – data loss is not acceptable
- Either “at-least-once” or “exactly-once”
- When is “exactly-once” possible?
§ Error handling – reasonable reaction to unreasonable data
§ Security – Authentication, authorization, audit, lineage
Questions? tiny.cloudera.com/app-arch-questions
Anti-Pattern
Using complex event processing framework for ingest
Don’t do that. Ingest is different.
Questions? tiny.cloudera.com/app-arch-questions
Good Patterns
§ Ingest all things. Events, evidence, alerts
- It has potential value
- Especially when troubleshooting
- And for data scientists
§ Make sure you take your schema with you
- Yes, your data probably has a schema
- Everyone accessing the data (web apps, real time, stream, batch) will need to
know about it
- So make it accessible in a centralized schema registry
Questions? tiny.cloudera.com/app-arch-questions
Choosing Ingest
Flume
•Part of Hadoop
•Key-value based
•Supports HDFS and HBase
•Easy to configure
•Serializers provide conversions
•At-least once delivery guarantee
•Proven in huge production
environments
Kafka Connect
•Part of Apache Kafka
•Schema preserving
•Supports HDFS + Hive
•Well integrated support for Parquet,
Avro and JSON
•Exactly-once delivery guarantee
•Easy to manage and scale
•Provides back-pressure
Our choice for
this example
Questions? tiny.cloudera.com/app-arch-questions
Ingestion - Flume
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink
Sink
Sink
Sink
SolR
Flume HBase Sink
Sink
Sink
Sink
HBase

More Related Content

What's hot

Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
How to use abap cds for data provisioning in bw
How to use abap cds for data provisioning in bwHow to use abap cds for data provisioning in bw
How to use abap cds for data provisioning in bw
Luc Vanrobays
 
Migrating database content from sql server to sap hana
Migrating database content from sql server to sap hanaMigrating database content from sql server to sap hana
Migrating database content from sql server to sap hana
venu212
 
AWS October Webinar Series - Introducing Amazon QuickSight
AWS October Webinar Series - Introducing Amazon QuickSightAWS October Webinar Series - Introducing Amazon QuickSight
AWS October Webinar Series - Introducing Amazon QuickSight
Amazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
ABD317_Building Your First Big Data Application on AWS - ABD317
ABD317_Building Your First Big Data Application on AWS - ABD317ABD317_Building Your First Big Data Application on AWS - ABD317
ABD317_Building Your First Big Data Application on AWS - ABD317
Amazon Web Services
 
Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations
CloudHesive
 
SAP Cloud Platform Integration Services – L1 Deck
SAP Cloud Platform Integration Services – L1 DeckSAP Cloud Platform Integration Services – L1 Deck
SAP Cloud Platform Integration Services – L1 Deck
SAP Cloud Platform
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed Traces
Jorge Cardoso
 
Quick Help in Vistex Technical
Quick Help in Vistex TechnicalQuick Help in Vistex Technical
Quick Help in Vistex Technical
SAPYard
 
Various Table Partitioning in SAP HANA
Various Table Partitioning in SAP HANAVarious Table Partitioning in SAP HANA
Various Table Partitioning in SAP HANA
Debajit Banerjee
 
20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast
Amazon Web Services Japan
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
Lam Le
 
How to Bring Suppliers to the Ariba Network
How to Bring Suppliers to the Ariba NetworkHow to Bring Suppliers to the Ariba Network
How to Bring Suppliers to the Ariba Network
SAP Ariba
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
SAP S/4HANA on AWS Tシャツモデル
SAP S/4HANA on AWS TシャツモデルSAP S/4HANA on AWS Tシャツモデル
SAP S/4HANA on AWS Tシャツモデル
Tetsuya Kawahara
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
Amazon Web Services Korea
 
AWS Data Analytics on AWS
AWS Data Analytics on AWSAWS Data Analytics on AWS
AWS Data Analytics on AWS
sampath439572
 

What's hot (20)

Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
How to use abap cds for data provisioning in bw
How to use abap cds for data provisioning in bwHow to use abap cds for data provisioning in bw
How to use abap cds for data provisioning in bw
 
Migrating database content from sql server to sap hana
Migrating database content from sql server to sap hanaMigrating database content from sql server to sap hana
Migrating database content from sql server to sap hana
 
AWS October Webinar Series - Introducing Amazon QuickSight
AWS October Webinar Series - Introducing Amazon QuickSightAWS October Webinar Series - Introducing Amazon QuickSight
AWS October Webinar Series - Introducing Amazon QuickSight
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
ABD317_Building Your First Big Data Application on AWS - ABD317
ABD317_Building Your First Big Data Application on AWS - ABD317ABD317_Building Your First Big Data Application on AWS - ABD317
ABD317_Building Your First Big Data Application on AWS - ABD317
 
Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations
 
SAP Cloud Platform Integration Services – L1 Deck
SAP Cloud Platform Integration Services – L1 DeckSAP Cloud Platform Integration Services – L1 Deck
SAP Cloud Platform Integration Services – L1 Deck
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed Traces
 
Quick Help in Vistex Technical
Quick Help in Vistex TechnicalQuick Help in Vistex Technical
Quick Help in Vistex Technical
 
Various Table Partitioning in SAP HANA
Various Table Partitioning in SAP HANAVarious Table Partitioning in SAP HANA
Various Table Partitioning in SAP HANA
 
20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
How to Bring Suppliers to the Ariba Network
How to Bring Suppliers to the Ariba NetworkHow to Bring Suppliers to the Ariba Network
How to Bring Suppliers to the Ariba Network
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
SAP S/4HANA on AWS Tシャツモデル
SAP S/4HANA on AWS TシャツモデルSAP S/4HANA on AWS Tシャツモデル
SAP S/4HANA on AWS Tシャツモデル
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
 
AWS Data Analytics on AWS
AWS Data Analytics on AWSAWS Data Analytics on AWS
AWS Data Analytics on AWS
 
Easy dms basic process guide
Easy dms basic process guideEasy dms basic process guide
Easy dms basic process guide
 

Viewers also liked

Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
Anu Ravindranath
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Fraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive AnalyticsFraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive Analytics
Alejandro Correa Bahnsen, PhD
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Behavioral Analytics for Preventing Fraud Today and Tomorrow
Behavioral Analytics for Preventing Fraud Today and TomorrowBehavioral Analytics for Preventing Fraud Today and Tomorrow
Behavioral Analytics for Preventing Fraud Today and Tomorrow
Guardian Analytics
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Sudarson Roy Pratihar
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
1609 Fraud Data Science
1609 Fraud Data Science1609 Fraud Data Science
1609 Fraud Data Science
Alejandro Correa Bahnsen, PhD
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
Mk Kim
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Impetus Technologies
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
Adam Gibson
 
Risk Analysis using open FAIR and Adoption of right Security Controls
Risk Analysis using open FAIR and Adoption of right Security ControlsRisk Analysis using open FAIR and Adoption of right Security Controls
Risk Analysis using open FAIR and Adoption of right Security Controls
Priyanka Aash
 
Network Forensics and Practical Packet Analysis
Network Forensics and Practical Packet AnalysisNetwork Forensics and Practical Packet Analysis
Network Forensics and Practical Packet Analysis
Priyanka Aash
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Practical Applications of Block Chain Technologies
Practical Applications of Block Chain Technologies Practical Applications of Block Chain Technologies
Practical Applications of Block Chain Technologies
Priyanka Aash
 

Viewers also liked (20)

Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Fraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive AnalyticsFraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive Analytics
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Behavioral Analytics for Preventing Fraud Today and Tomorrow
Behavioral Analytics for Preventing Fraud Today and TomorrowBehavioral Analytics for Preventing Fraud Today and Tomorrow
Behavioral Analytics for Preventing Fraud Today and Tomorrow
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
1609 Fraud Data Science
1609 Fraud Data Science1609 Fraud Data Science
1609 Fraud Data Science
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
 
Risk Analysis using open FAIR and Adoption of right Security Controls
Risk Analysis using open FAIR and Adoption of right Security ControlsRisk Analysis using open FAIR and Adoption of right Security Controls
Risk Analysis using open FAIR and Adoption of right Security Controls
 
Network Forensics and Practical Packet Analysis
Network Forensics and Practical Packet AnalysisNetwork Forensics and Practical Packet Analysis
Network Forensics and Practical Packet Analysis
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Practical Applications of Block Chain Technologies
Practical Applications of Block Chain Technologies Practical Applications of Block Chain Technologies
Practical Applications of Block Chain Technologies
 

Similar to Fraud Detection with Hadoop

Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
SnappyData
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
Jonathan Seidman
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Observability in real time at scale
Observability in real time at scaleObservability in real time at scale
Observability in real time at scale
Balvinder Hira
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
Jonathan Seidman
 

Similar to Fraud Detection with Hadoop (20)

Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Observability in real time at scale
Observability in real time at scaleObservability in real time at scale
Observability in real time at scale
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
markgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
markgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
markgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
markgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
markgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
markgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
markgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
markgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
markgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
markgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 

Recently uploaded

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 

Recently uploaded (20)

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 

Fraud Detection with Hadoop