The session is focused on solutions that require high-throughput ingestion & streaming of data in real-time. You'll get familiar with different business uses-cases and architecture examples to get a common idea as well as understand the concepts of stream processing systems. Next you'll get deep insights into functional and non-functional capabilities of Azure Event Hub service to see how it fits into the whole picture. Moreover you'll learn current constraints of this service to be able to qualify it's usage for your particular scenario
2. 2
Few words about myself…
I’m Alexander Laysha
• Solution Architect from EPAM Systems & Microsoft
Azure MVP
• Focused on backend, high-load and cloud solutions
• Leader of Belarus Azure Community
• Speaker at local and external meetups and
conferences
My contacts
• Email: layshaalex@gmail.comom
• Twitter: @layshaalexander
• Facebook: alexander.laysha
3. 3
• Business needs for real-time analytics
• Use-cases & architecture approaches
• Basics of real-time data streaming platforms
• Azure Event Hub capabilities & constraints
• Pricing calculations for multiple data ingestion scenarios based on Event Hub
• Summary
Which topics will we cover?
4. 4
Past world
• Capture data for later analysis
• Reports and analytics with X days latency
Current days
• Dealing with tons of data
• Offline report and analysis in no longer enough (but still important)
• Business want to get immediate insights from captured data with X
seconds/minutes latency
Business needs for data analytics
5. 5
IoT – device operational intelligence and pro-active alerts
Gaming Industry – real-time board with game leaders and scores
E-Commerce – online recommendation engines and proactive care
Operations - analyze real-time data to respond to dynamic environments in
order to take immediate action
Financial - monitor financial transactions in real-time to detect fraudulent
activity
Just few use-cases…
6. 6
Collection – captured data from
multiple sources
Streaming - high-throughput data
pipeline systems like Kafka, Kinesis,
Event Hub
Processing – stream processing
platforms that performs a certain
task to produce output
Serving – app for stream processing
output consumption – UI, posts, DB,
report viewers, APIs
High-level real-time streaming architecture
OUR FOCUS
7. 7
Persistence and batch - data is stored in
a persistence layer from which it is
ingested and processed by the batch
layer periodically (may includes stream
processing for on-fly ETL)
Speed layer - handles the portion of the
data that has not-yet been processed by
the batch layer (includes stream
processing and storage)
Serving layer - consolidates both by
merging the output of the batch and the
speed layer
High-level view of Lambda Architecture
8. 8
High-level view of Kappa Architecture
Persistence – stores initial raw data for
historical purposes and can be used to
replay computations from initial data
stream
Speed Layer - the basic idea is to not
periodically recompute all data in the
batch layer, but to do all computation on
Speed Layer in the stream processing
system alone and only perform
recomputation when the business logic
changes by replaying historical data
11. 11
Common terminology
Producer Producer Producer Publisher
Consumer Consumer Kinesis Stream
Applications
Consumer
Stream Topic Stream Event Hub
Partition Partition Shard Partition
Index Offset Sequence Number Offset
Consumer
Group
Consumer
Group
Application Consumer
Group
12. 12
• Designed to handle very large quantities of small messages
• Horizontally scalable by using partitions and consumer groups
• Reliable and fault-tolerant
• Configurable data replication
• Configurable message TTL (stream level)
• Supports at-least-once delivery
• Logical data organization using partitions
• Separate date view for consumer by using consumer groups and indexes
• Ability to replay messages
• Messages with the same key are sent to the same partition
• Guarantee of message order in scope of partition
• Integrated with modern stream processing platforms (Stream Analytics,
Storm, Spark, etc.)
Common characteristics
13. 13
Let’s take a close look to Azure Event Hub
Event
Producers
> 1M Producers
> 1GB/sec
Aggregate
Throughput
Direct
PartitionKey
Hash
Throughput Units:
• 1 ≤ TUs ≤ Partition Count
• TU: 1 MB/s writes, 2 MB/s
reads
Namespace
14. 14
Ways to publish - individual event or batch:
• Round Robin
• Partition Id
• Partition Key
Supported Protocols:
• HTTPS – short-lived (low throughput)
• AMQP 1.0 – long-lived, (high throughput)
Publisher Policy - run-time feature designed to facilitate large numbers of
independent event publishers by using unique identifier and virtual endpoint:
//[my namespace].servicebus.windows.net/[event hub
name]/publishers/[my publisher name]
Event Hub Publishers
Event
Producers
15. 15
Events listening - consumer connects to a partition using AMQP 1.0 protocol
and listens for incoming events
Consumer Groups - is a view (state, position, or offset) of an entire event hub.
Consumer groups enable multiple consuming applications to each have a
separate view of the event stream, and to read the stream independently at
their own pace and with their own offsets
Event Hub Consumers
16. 16
• Security model is based on Shared Access Signature (SAS) tokens
• Shared access policy (key) supports following claims: Send, Listen, Manage
• Shared access policy (key):
• can be created on namespace or event hub level
• includes Primary and Secondary keys
• Primary and Secondary key can be revoked
• SAS tokens can be created on namespace, event hub or publisher level
• Granular control over event publishers through publisher policies (publisher
name should be the same as partition name, SAS token should be for
publisher endpoint)
• Event publishers can be revoked in case of usage of publisher specific SAS
token
Event Hub Security
17. 17
• Automatic persistence of ingested events from Even Hub in Apach Avro format
• Supported storages:
• Azure Storage
• Azure Data Lake
• Configurable size & time windows per partition
Event Hub Capturing
18. 18
Monitoring
• Integrated with Azure Monitor
• Type of diagnostics data: archive logs, operational logs, auto-scale logs, all
metrics
• Diagnostic logs can be send to: storage account, event hub, Log Analytics
Availability & Disaster Recovery
• SLA for 99,9% for operations on Event Hub
• HA is guaranteed by replication and availability sets
• In case of failure one of the partitions, other partitions will be available
• No built-in options for disaster recovery of Event Hub between regions (custom
solution: use events capturing with geo redundant storage and custom code to
populate Event Hub in another region)
Event Hub Monitoring & Disaster Recovery
19. 19
• Throughput Unit – unit of scalability, shared across all event hubs in
namespace
• Manually or programmatically set (TUs)
• 1 TU = 1 MB/sec or 1000 events/sec on ingress, 2 MB/sec on egress, Max 100
TU for Standard Tier (contact support team)
• Dedicated Event Hub: 1 CU = ~200 TU, max 8 CU
• Enable Auto-Inflate for auto scaling up of TUs with ability to specify limits
• Partition count: 2-32. Count is not changeable and must be specified during
creation (count can be increased by contacting Microsoft)
• Consumer Group count: up to 20 per Event Hub
• 5 max concurrent readers on a partition per consumer group (recommended
to use one active receiver on a partition per consumer group)
Event Hub Scalability
20. 20
• Single tenant hosting with no noise from other tenants, available to
customers with an enterprise agreement
• Repeatable performance every time
• No additional charge for incoming messages
• Message size increases to 1 MB as compared to 256 KB for Standard and Basic
• Scalable between 1 and 8 capacity units (CU) – providing up to 2 million
ingress events per second
• CUs manage the scale for Event Hubs Dedicated, 1 CU = ~200 TU, max 8 CU
• Zero maintenance: management of load balancing, OS updates, security
patches, and partitioning
• Fixed monthly pricing: ~720$ per day for 1 CU (pricing & CU size will change
starting from October 2017: ¼ CU for 5000$ per month)
Dedicated Event Hub
21. 21
• Max 10 Event Hubs per Namespace
• Partition limit is 1 TU
• Number of AMQP connections per namespace: 5000
• Only Azure deployment, No Azure Stack support yet
• No SAS on consumer group level, no built-in encryption or compression of
event body
• No functionality to drain Event Hub (need to create custom drainer)
• No local emulator
Other non covered Event Hub Constraints
22. 22
!!! COSTS JUST FOR INCOMING TRAFFIC WITHOUT STORAGE COSTS
1.000 msg/sec, 1 KB in size, 1 MB/sec, 24/7 = 1 TU Standard Pricing Tier *
22.32$ + (1.000 * 60sec * 60min * 24hrs * 31d)/1.000.000 * 0.028$ = 22.32$ +
75$ = 97.3$ per month - GOOD
100.000 msg/sec, 1 KB in size, 100 MB/sec, 24/7 = 100 TU (MAX) Standard
Pricing Tier * 22.32$ + (100.000 * 60sec * 60min * 24hrs * 31d)/1.000.000 *
0.028$ = 2.232$ + 7.500$ = 9.730$ per month - GOOD
1.000.000 msg/sec, 1 KB in size, 1 GB/sec, 24/7 = 4 CU Dedicated Pricing Tier
* 720$ * 31d = 89.280$ per month - TOO MUCH!
Event Hub pricing for ingestion
23. 23
• Azure Event Hub is capable to handle middle-loaded scenarios (100.000
msg/sec or 100 MB/sec) in cost affective manner and provides good feature
parity
• For high-loaded scenarios (1.000.000+ msg/sec or 1+ GB/sec) or big-data
scenarios it seems too expensive (Apache Kafka cluster more cheaper but
requires invest into tuning & maintenance costs)
• Always consider quality attribute requirements for your system before
moving forward with technology decisions. PaaS is not always right choice in
case of high-loaded scenarios
Summary