Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with Kafka and Cassandra clusters” AI&BigDataDay 2017

www.eleks.comwww.eleks.com
Azure Real-Time Analytics And Kappa Architecture
with Kafka and Cassandra clusters
Vitalii Bondarenko
vitaliy.bondarenko@eleks.com

Agenda
 Streaming analytics in Azure
 Apache Cassandra
 Apache Kafka
 Confluent Platform and Kafka Streams
 Examples

Big Data Approach
RDBMS Approach
• Massive Parallel Processing (Scalability)
• In-memory DB (Streaming and
compressing)
• Colum stores (BI)
Big Data Approach
• Hadoop (HDFS + MapReduce)
• SQL on HDFS
• Scalable NoSQL
• Batch issue

Lambda architecture
Batch & Stream Processing
• Batch layer
• Stores master dataset
• Compute arbitrary views
• Horizontally Scalable
• Speed layer (Streaming)
• Fast, incremental algorithms
• Batch layer eventually overrides speed
layer
• Serving layer
• Random access to batch views
• Updated by batch and Streaming layer

Kappa architecture
Stream Processing with Scalable Storages
• Everything is a stream
• Immutable unstructured data sources
• Single analytics framework
• Windows on Streaming Layer
• Linearly scalable Serving Layer
• Interactive querying

Azure Streaming Analytics
• Easy to use
• Scalable
• Connectivity
• SQL, UDF, Reference Data

Streaming processing
Windows
Stateful
Stateless
Fault Tolerance
Scalability
Low Latency
SELECT
Make,
System.TimeStamp AS Time,
COUNT(*) AS [Count]
INTO
AlertOutput
FROM
Input TIMESTAMP BY Time
GROUP BY
Make,
TumblingWindow(second, 10)
HAVING
[Count] >= 3

Demo: Azure Streaming Analytics
• Data from Event Hub
• Geo-Analytics on Streaming
• Visualization on PowerBI
• Demo for streaming analytics projects

Fast Data Platform
• Real-time processing
• Raw Data fast writing
• Scalabale
• Distributed

Demo: Azure Streaming Analytics
• Demo for streaming analytics projects
• Platform deployment

Apache Cassandra
• Multi-master, low-latency, shared nothing
• Distributed
• No single point of failure
• Linearly Scalable
• Multi-datacenter configuration
• AP with tunable consistency

Nodes and distributions
• Distributed by Tokens from -2^63 to 2^63-1
• Hash from partition key. Murmur3
• Virtual Nodes
• Data Centers and Racks, Gossip (each 1 sec)
• Replication Strategy (SimpleStrategy, NetworkTopologyStrategy)
• Replication factor (usually 3), Gossip and Coordinators
• Tunable consistency, strong and eventual
• Consistency Levels (One, Two, Three, Any, All, Quorum,
Local_Quorum, Local_One…)
• (R +W) > N

Cassandra Objects
Column, which is a name/value pair
Row, which is a container for columns referenced by
a primary key
Table, which is a container for rows
Keyspace, which is a container for tables
Cluster, which is a container for keyspaces that
spans one or more nodes Tombstones for deleted rows
TTL for deleting rows
Compaction for merging SSTables
Secondary Indexes for filtering
CQL (Cassandra Query Language)

CQL (Cassandra Query Language)
• Similar to SQL
• No Joins, Counters, Static Columns
• Keyspaces with replication factor
• SET, LIST, MAP, Tuples
• TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/
• Ordering and Filtering is not working sometimes (always use partition key)
CREATE TABLE loads (
machine inet,
cpu int,
mtime timeuuid,
load float,
PRIMARY KEY ((machine, cpu), mtime)
) WITH CLUSTERING ORDER BY (mtime DESC);
/* Select Data within a range */
SELECT * FROM myTable WHERE myField > 5000
AND myField < 100000;
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.

Data Modeling: Query-First Design
RDBMS: Data > Models > Application
Cassandra: Application > Models > Data
RDBMS
Device
User
Location
Values (Timestamp, Values)
Cassandra (no joins)
raw_values(Timestamp, Device, User, Location, Values)
day_values
hour_location_values
hour_location_device_values

Demo: Apache Cassandra in use
• Demo for Apache Cassandra

Apache Kafka
• From LinkedIn, Open Source from 2012
• Service Bus
• Small messages (events)
• Scalable Broker System
• Durable and Distributed
• Very fast (parallelism on partitions)
• No removes from queue, retention
• Streaming processing capability
LinkedIn:
• 1400 brokers
• 13M+ messages/sec
• 2.75GB per second

Brokers, Topics
• Distributed Service Bus
• Broker as virtual servers
• Topics as logical data storage

Writes and reads
• Append Only
• Commit log
• Consumer offset (from beginning)
• Commit read to Kafka topic _consumer_offsets
• Commit when read data
• Retention period 7 days

Partitions, Replication, Zookeeper
• Workers, tasks, statuses
• Find leaders for a partition
• Distribute tasks
• Monitor results
• Configuration
• Health statuses
• Group memberships (elections)
• Durable and Scalable
Message (timestamp, id, Payload (binary))
Replication Factor
Partition for each broker
Partition = commit log
Replica Leader elected and saved to Zookeeper

Consumer Groups • Continues Polling the brokers for topic
• Consumers Group.Id for parallelism and scalability
• Consumers registered in Zookeper
• Coordinators assign partitions to consumers
• Coordinator rebalance partitions between consumers
• Amount of Consumers = Amount of Partitions

Demo: Apache Kafka in use
• Apache Kafka CLI
• Fault Tolerance

Confluent Platform
• Schema registry
• Kafka Connect
• REST Proxy
• Kafka Stream
• Confluent is the contributor
• Streaming platform
• Open Source main parts

Kafka Streams
• Easy to deploy and maintain
• Integrated with Confluent Platform
• Process topology
• Micro-services
• State store

KStream and KTable
• Execution topology for process
• Statefull, no need to go to DB for every event
• KStream and KTables
• KTable is local distributed database
• Data Locality, No network roundtrips
• Elastic and Scalable

Demo: Streaming App Examples
• Cassandra Shema
• Connectors
• Kafka Streams
• DS/OS

Cluster and results
Kafka Cluster
Nodes: 9
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 100Gb
Topics: 3
Partitions: 6 per topic
Replication Factor: 3
Producers: 6
Average message size: 1Kb
440 000 messages / second
Cassandra Cluster
Nodes: 12
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 800Gb
Replication Factor: 3
Average write latency: 9 ms
Average read latency: 52 ms

Lessons Learned
• Amazing Capabilities, more than 1M/sec
• New mindset of streaming processing
• Time is important
• SQL Interface for streaming is not ready
• Difficulties in management and scalability
• Difficult to debug
• Lack of documentation and community support
• Design you DB carefully at the beginning for queries
• Cassandra is not RDBMS, select by partition keys
• Eventual consistency
• Very Expensive! (lots of nodes)

www.eleks.comwww.eleks.com
Q&A
Vitalii Bondarenko
vitaliy.bondarenko@eleks.com

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with Kafka and Cassandra clusters” AI&BigDataDay 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with Kafka and Cassandra clusters” AI&BigDataDay 2017

Similar to Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with Kafka and Cassandra clusters” AI&BigDataDay 2017 (20)

More from Lviv Startup Club

More from Lviv Startup Club (20)

Recently uploaded

Recently uploaded (20)

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with Kafka and Cassandra clusters” AI&BigDataDay 2017