Streaming Patterns Revolutionary Architectures with the Kafka API

© 2016 MapR Technologies L1-1®
© 2016 MapR Technologies
®
Streaming Patterns, Revolutionary
Architectures
Carol McDonald

Agenda
Streams Core Components
•  Topics, Partitions
•  Fault Tolerance
•  High Availability
Patterns
•  Event Sourcing
•  Duality of Streams and Databases
•  Command Query Responsibility Separation
•  Polyglot Persistence, Multiple Materialized Views
•  Turning the Database Upside Down
Real World Examples
•  Fraud Detection
•  Healthcare Exchange

Which products are we discussing?

© 2016 MapR Technologies© 2016 MapR Technologies
Streams Core Components

What’s a Stream ?
Producers ConsumersEvents_Stream
A stream is an unbounded sequence of events carried
from a set of producers to a set of consumers.
Events

What is Streaming Data? Got Some Examples?
Data Collection
Devices
Smart Machinery Phones and Tablets Home Automation
RFID Systems Digital Signage Security Systems Medical Devices

Why Streams?
Trigger Events:
•  Stock Prices
•  User Activity
•  Sensor Data
Topic
Many Big Data sources are Event Oriented
StreamStreamStream
Event Data
TopicTopic
Real-Time Analytics

Analyze Data
What if you need to analyze data as it arrives?

It was hot
at 6:05
yesterday!
Batch Processing with HDFS
Analyze
6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
90°90°
6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°

Event Processing with Streams
6:05 P.M.: 90°
To
pic
Stream
Temperature
Turn on the air
conditioning!

Organize Data
What if you need to organize data as it arrives?

Integrating Many Data Sources and Applications
Sources
(Producers)
Applications
(Consumers)
Unorganized, Complicated, and Tightly Coupled.

Organize Data into Topics with MapR Streams
Topics Organize Events into Categories and Decouple Producers from Consumers
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API

Process High Volume of Data
What if you need to process a high volume of data as it arrives?

What if BP had detected problems before the oil hit the water ?
•  1M samples/sec
•  High performance at
scale is necessary!

Legacy Messaging
Millions of
Sources
Hundreds of
Destinationsinsert
Legacy Message
Queue:
Message rate
<100K/s
Publish
Acks
delete
Consume
Acks

Mechanisms for Decoupling
Traditional message queues?
•  Huge performance hit for persistence:
•  message acknowledgement per message per consumer
•  Lots of Non sequential disk I/O when messages added/removed

Scalable Messaging with MapR Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Server 3
Topics are partitioned for throughput and scalability

Producers are load balanced between partitions
Kafka API

Consumers
Consumers
Consumers
Consumer groups can read in parallel
Kafka API

Core Components: Partitions
Consumers
MapR Cluster
Topic: Admission / Server 1
Consumers
Consumers
Partition
1
Partitions:
–  Messages are
appended in
order
Offset:
–  Sequential id of a
message in a
partition Partition
2
Partition
3
6 5 4 3 2 1
3 2 1
5 4 3 2 1
Producers
Producers
Producers
New
Message
6 5 4 3 2 1
Old
Message

Read Cursors
•  Read cursor: offset ID of most recent read message
•  Producers Append New messages to tail
•  Consumers Read from head
MapR Cluster
6 5 4 3 2 1
Consumer
groupProducers
Read cursors
Consumer
group

Consumers
MapR Cluster
Consumers
Consumers
Partition
1
Partition
2
Partition
3
6 5 4 3 2 1
3 2 1
5 4 3 2 1
Producers
Producers
Producers
Events are delivered in the order they are received, like a queue.
Partitioned, Sequential Access =
High Performance New
Message
6 5 4 3 2 1
Old
Message

Unlike a queue, events are persisted even after they’re delivered
Messages remain on the partition, available to other consumers
Minimizes Non-Sequential disk read-writes
MapR Cluster (1 Server)
Topic: Warning
Partition
1
3 2 1 Unread Events
Get Unread
3 2 1
Client Library ConsumerPoll

Considering a Messaging Platform
Kafka-esque Logs?
•  Sequential writing/reading disk:
•  Messages are persisted sequentially as produced, and read sequentially when consumed
•  Performance plus persistence
•  performance of up to a billion messages per second at millisecond-level delivery times.
Kafka model is BLAZING fast
•  Kafka 0.9 API with message sizes at 200 bytes
•  MapR Streams on a 5 node cluster sustained 18 million events / sec
•  Throughput of 3.5GB/s and over 1.5 trillion events / day

When Are Messages Deleted?
•  Messages can be persisted forever
Or
•  Older messages can be deleted automatically based on time to live
MapR Cluster (1 Server)
6 5 4 3 2 1Partition
1
Older
message

Parallelism When Reading
To read messages from the same Topic in parallel:
•  create consumer groups
•  consumers with same group.id
•  partitions assigned dynamically round-robin
Consumer group: Oil Wells
Consumer A
Consumer B
Consumer C
MapR Cluster
Partition 4: Warning

Fault Tolerance Consumption: Partitions Re-Assigned Dynamically
If consumer goes offline, partitions re-assigned
Consumer group.id: Oil Wells
Consumer A
Consumer C
MapR Cluster
Partition4: Warning
Partition3: Warning
Partition2: Warning
Partition1: Warning
Partition5: Warning

Processing Same Message for Different Views
Consumers
Consumers
Consumers
Producers
Producers
Producers
MapR-FS
Kafka API Kafka API
Pub Sub: Multiple Consumers, Multiple Destinations

Partition Fault Tolerance

Message Recovery
What if you need to recover messages in case of server failure?

Partitions are Replicated for Fault Tolerance
Producer
Producer
Server 2 Partition2: Topic - Warning
Producer
Server 2
Server 3
Server 1
Server 3
Server 1
Server 2

Partition1: Warning
Partition2: Warning Replica
Partition3: Warning
Producer
Producer
Producer
Server 1
Server 2
Server 3
Security Investigation &
Event Management
Operational
Intelligence
Real-time Analytics
Partition2: Warning

Producer
Producer
Producer
Event Management
Operational
Intelligence
Real-time Analytics
Partition1: Warning
Partition3: Warning
Server 1
Server 2
Server 3
Partition2: Warning

Partitions are Replicated for Fault tolerance
Producer
Producer
Producer
Event Management
Operational
Intelligence
Real-time Analytics
Partition1: Warning
Partition3: Warning
Server 1
Server 2
Server 3
Partition2: Warning

Streams and High Availability

•  Stream:
–  collection of topics managed together
•  Manage stream:
–  replication
–  security
–  time-to-live
–  number of partitions
Core Components: Streams
Stream
Pressure
Temperature
Warning
Stream
Pressure
Temperature
Warning
Consumers
Consumers
Consumers
Consumers
Producers
Producers
Replication

Real-time Access
What if you need real-time access to live data distributed across multiple clusters
and multiple data centers?

Lack of Global Replication
Topic: C

Streams and Replication
Streams:
•  are a collection of topics
•  can be replicated worldwide
Topic: A
Topic: B
Topic: C
Topic: A
Topic: B
Topic: C
Replicating to
another
cluster

Streams and Replication
Topic: A
Topic: B
Topic: C
Fail Over
Streams:
•  high availability
•  disaster recovery

Replicating Streams: Master-Slave Replication
Venezuela_HA
Cluster
Metrics Stream
MetricsProducers
Venezuela
Cluster
Metrics Stream
Metrics
Consumers
High Availabiltiy
Backup for
Venezula
Master Slave

Replicating Streams: Many-to-One Replication
Houston
Metrics Stream
Metrics
Producers Venezuela
Metrics Stream
MetricsConsumers
Consumers
Producers Mexico
Metrics Stream
MetricsConsumers
Analyze all data from
Houston
Many
One

Replicating Streams: Multi-Master Replication
Producers Seoul
Metrics Stream
MetricsConsumers
ProducersSan Francisco
Metrics Stream
Metrics Consumers
Both send and receive updates

Stream Replication
WAN
Stream
Pressure
Temperature
Warning
Stream
Pressure
Temperature
Warning
Stream
Pressure
Temperature
Warning

Ship picks up containers…
Singapore

Arrives at destination…
Tokyo

While enroute to next destination…
Washington

Where does the data live…
Singapore Washington
Tokyo

What is important about this?
Data is generated on the ship
•  Must have an easy way (i.e. foolproof) to move the data off the ship
Each port stores the data from the ship
•  Moving data between locations
•  Analytics could happen at any location
This is a multi-data center time series data use case
•  Events from sensors = metrics
•  Same concepts as data center monitoring

Patterns

Event Sourcing
Updates
Imagine each event as a change to an entry in a database.
Account Id Balance
WillO 80.00
BradA 20.00
1: WillO : Deposit : 100.00
2: BradA : Deposit : 50.00
3: BradA : Withdraw : 30.00
4: WillO : Withdraw: 20.00
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Change log
4 3 2 1
credit, debit events
current account balances

Replication
Change Log
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
3 2 1 3 2 1
3 2 1
Duality of Streams and Tables:
Database: captures data at rest
Stream: captures data change
Master:
Append writes
Slave:
Apply writes in order

Which Makes a Better System of Record?
Which of these can be used to reconstruct the other?
Account Id Balance
WillO 80.00
BradA 20.00
Change Log
3 2 1

Rewind: Reprocessing Events
MapR Cluster
6 5 4 3 2 1Producers
Reprocess from
oldest message
Consumer
Create new view, Index, cache

Rewind Reprocessing Events
MapR Cluster
6 5 4 3 2 1Producers
To Newest message
Consumer new view
Read from
new view

Event Sourcing, Command Query Responsibility Separation:
Turning the Database Upside Down
Key-Val Document Graph
Wide
Column
Time
Series
Relational
???Events Updates

What Else Do I Use My Stream For?
Lineage - “how did BradA’s balance get so low?”
Auditing - “who deposited/withdrew from BradA’s account?”
History – to see the status of the accounts last year
Integrity - “can I trust this data hasn’t been tampered with?”
•  Yup - Streams are immutable

What Do I Need For This to Work?
Infinitely persisted events
A way to query your persisted stream data
An integrated security model across the stream and databases

Fraud Detection
Point of Sale -> Data Center is Transaction Fraud ?
•  Lots of requests
•  Need answer within ~ 50 100 milliseconds
Data
Center
Point of Sale
Location, time, card#
Fraud yes/no ?

Traditional Solution
POS
1..n
Fraud
detector
Last card
use
1.  Look up last card use
2.  Compute the card velocity:
•  Subtract last location, time from
current location, time
3.  Update last card use

What Happens Next?
POS
1..n
Fraud
detector
Last card
use
POS
1..n
Fraud
detector
POS
1..n
Fraud
detector
1.  Look up last card use
2.  Compute the card velocity
3.  Update last card use
Bottleneck !

Service Isolation: Separate Read from Write
POS
1..n
Fraud
detector
Last card
use
Updater
card activity
Read
Read last card use

Separate Read Model from the Write Model:
Command Query Responsibility Separation
POS
1..n
Fraud
detector
Last card
use
Updater
card activity
Read
Event last card use
Write last card use

Event Sourcing: New Uses of Data
Processing Same Message for Multiple Views
POS
1..n
Fraud
detector
Last card
use
Updater
Card
location
history
Other
card activity

Scaling Through Isolation allows Multiple Consumers
POS
1..n
Last card
use
Updater
POS
1..n
Last card
use
Updater
card activity
Fraud
detector
Fraud
detector
Multiple fraud detectors can use the same message queue
•  De-coupling and
isolation are key
•  Propagate
events, not table
updates

Decoupled Architecture
Producer
Activity Handler
Producer
Producer
Historical
Interesting
Data Real-time
Analysis
Results Dashboard
Anomaly
Detection
more than one component can
make use of the same stream of messages for a variety of uses

Lessons
De-coupling and isolation are key
Propagate events, not table updates

Building Enterprise Software vs Internet Companies
Enterprise Software:
Complexity of domain =>
Business logic, Business rules
Banking, Healthcare, Telecom
Compliance=>
Security
Internet Companies:
Volume of data =>
Complex data infrastructure
Large Scale Availability, Recovery
Reference Martin Kleppmann

Building Enterprise Software vs Internet Companies
Enterprise Software:
Event Sourcing
Internet Companies:
Stream Processing
Reference Martin Kleppmann

Real World Solution

Credit Card Fraud Model Building

ServeNoSQL StorageData Ingest
Fraud Stream Processing Architecture
Stream
ProcessingSource
MapR-FS
MapR-DB
Topic: A
Topic: B
Topic: C
Topic: A
Topic: B
Topic: C

Streams
Messaging
Fraud Processing
Stream Processing
Derive
features
Model
raw
enriched
alerts
process
Batch Processing
MapR-FS
MapR-DB
MapR-DB
raw
enriched
alerts
Model
build model
update model

Streams
Messaging
Fraud Event Processing
Stream
Processing
NoSQL
Storage
MapR-FS
MapR-DB
Raw
Enriched
Fraud
1.  Parse raw event
2.  read card holder
profile from MapR-DB
3.  Derive features
4.  Get prediction from
model with features
5.  Publish not fraud to
enriched topic
6.  Publish fraud to
fraud topic

Fraud Processing Same Message for Different Views
Partition1: Topic – Raw Trans
Partition1: Topic – Enriched
Partition1: Topic – Fraud Alert
Partition2: Topic - Enriched
Partition3: Topic - Enriched
Consumers
MapR-FS
MapR-DB
Consumers
Consumers
Consumers
MapR-FS
MapR-DB
Consumers
Consumers
Consumers
MapR-FS
MapR-DB
Consumers
Consumers

Real World Solution

JSON DB
(MapR-DB)
Graph DB
(Titan on
MapR-DB)
Search Engine
(Elastic-Search)
Transforming the Health Care Ecosystem
Electronic Medical
Records
“The Stream is the
System of Record”
–Brad Anderson
VP Big Data Informatics

Liaison ALLOY™ Platform
79
Data Integration
ingest syndicatetransform
Data Management
master
deduplicate
harmonize
relate
merge
tokenize
store / persist
analyze
summarize
report
distill
recommend
explore
query
sandbox
batch transform
learn
traverse

Use Case: Streaming System of Record for Healthcare
Objective:
•  Build a flexible, secure
healthcare exchange
Records Analysis
Applications
Challenges:
•  Many different data models
•  Security and privacy issues
•  HIPAA compliance
Records

ALLOY Health:
Exchange State HIE
Clinical Data Viewer
Analytics queries like:
What are the outcomes in the entire state on diabetes?
Are there doctors that are doing this better than others?
Clinical Data
Financial Data
Provider
Organizations

2000+ Practices 200 + Labs 30,000 + Clinicians
OrdersAnywhere
PORTAL (no EHR)
EHR with
HL7 ONLY
EHR with WORKFLOW
INTEGRATION
RADIOLOGY
LAB

This is a PAIN !
COMPLIAN
CE
SECURITY CONTROLS
COMPLIANCE
FEATURES
PRIVACY
PCI DSS
3.0
21 CFR Part
11
SSAE16 /
SOC2
HIPAA/HITECH

WHY NOW?
84http://bit.ly/29aBatK

WHY NOW?
2014 FQ4 profit
$ -440 M
Total Cost Estimate
$ -12 B

Why Now? The Relational database is not the only tool
1234
Attribute Value
patient_id 1234
Name Jon Smith
Age 50
999
Attribute Value
patient_id 999
Name Jonathan
Smith
DOB Jun 1965
86
9876
Attribute Value
provider_id 86
Name Dr. Nora Paige
Specialty Diabetes
Attribute Value
rx_id 9876
Name Sitagliptin
Dosage 325mg
Visited
Prescribed
WasPrescribed
Patient
Patient
Prescription
Provider
Context and Relationships

WHY NOW? Mind the Gap
87

Streaming System of Record for Healthcare
Stream
Topic
Records
Applications
6 5 4 3 2 1
Search
Graph DB
JSON
HBase
Micro
Service
Micro
Service
Micro
Service
Micro
Service
Micro
Service
Micro
Service
A
P
I
Streaming System of Record Materialized
Views

89

Immutable Log
Raw
Data
workflow
Key/Value
(MapR-DB)
materialized
view
workflow
Search
Engine
materialized
view
CEP
k v v v v v
k v v v
k v v
k v v v v
k v v v
k v v v v v
Document Log
(MapR-FS)
log
API
App
pre-
processor
workflow
Graph
(ArangoDB)
materialized
view
workflow
Time
Series
(OpenTSDB)
materialized
view
micro
service
micro
service
micro
service
micro
service
micro
service
micro
service
micro
service
micro
service
App AppApp
...
The Promised Land
Compliance
Auditor

The Promised Land
Auditor smiley faces
•  Data Lineage
•  Audit Logging
•  Wire-level encryption
•  At Rest encryption
Replication
•  Disaster Recovery
•  EU – data can’t leave
Non-Stream / Non-”Big Data”
•  Software Development Lifecycle
•  System Hardening
•  Separation of Concerns
-  Dev vs Ops
•  Patch Management
90
Compliance
Auditor

Solution
Design/architecture solved some
•  Streams
•  Data Lineage/System of Record
•  Kappa Architecture (Kreps/Kleppman)
MapR solved others
•  Unified Security
•  Replication DC to DC
•  Converge Kafka/HBase/Hadoop to one cluster
•  Multi-tenancy (lots of topics, for lots of tenants)
91

API

Sample Producer: All Together
public class SampleProducer {
String topic=“/streams/pump:warning”;
public static KafkaProducer producer;
public static void main(String[] args) {
producer=setUpProducer();
for(int i = 0; i < 3; i++) {
String txt = “msg ” + i;
ProducerRecord<String, String> rec = new
ProducerRecord<String, String>(topic, txt);
producer.send(rec);
System.out.println("Sent msg number " + i);
}
producer.close();
}

public class MyConsumer {
public static String topic = "/stream/pump:warning”;
public static KafkaConsumer consumer;
public static void main(String[] args) {
configureConsumer(args);
consumer.subscribe(topic);
while (true) {
ConsumerRecords<String, String> msg=
consumer.poll(pollTimeOut);
Iterator<ConsumerRecord<String, String>> iter =
msg.iterator();
while (iter.hasNext()) {
ConsumerRecord<String, String> record = iter.next();
System.out.println(”read " + record.toString());
}
}
consumer.close();
}
}
Sample Consumer: All Together

Summary

Can we get “Extreme” ?
1+ Trillion Events
•  per day
Millions of Producers
•  Billions of events per second
Multiple Consumers
•  Potentially for every event
Multiple Data Centers
•  Plan for success
•  Plan for drastic failure
Think that is crazy? Consider having 100
servers and performing:
Monitoring and Application logs…
•  100 metrics per server
•  60 samples per minute
•  50 metrics per request
•  1,000 log entries per request (abnormally
small, depends on level)
•  1million requests per day
~ 2 billion events per day, for one small
(ish) use case
Extreme Average Reality

Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-FS)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Streams
Sources/Apps Bulk Processing

© 2016 MapR Technologies L1-10
0
®
bit.ly/jjug-aug2016
Find my slides & other related materials to this talk here:
or search:

1
®
MapR Blog
• https://www.mapr.com/blog/

2
®
…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-demand training course
discussions
●  Follow release announcements
●  Share and vote on product ideas
●  Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

Streaming Patterns Revolutionary Architectures with the Kafka API

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Streaming Patterns Revolutionary Architectures with the Kafka API

Similar to Streaming Patterns Revolutionary Architectures with the Kafka API (20)

More from Carol McDonald

More from Carol McDonald (16)

Recently uploaded

Recently uploaded (20)

Streaming Patterns Revolutionary Architectures with the Kafka API