5. Kappa architecture
Stream Processing with Scalable Storages
• Everything is a stream
• Immutable unstructured data sources
• Single analytics framework
• Windows on Streaming Layer
• Linearly scalable Serving Layer
• Interactive querying
8. Demo: Azure Streaming Analytics
• Data from Event Hub
• Geo-Analytics on Streaming
• Visualization on PowerBI
• Demo for streaming analytics projects
9. Fast Data Platform
• Real-time processing
• Raw Data fast writing
• Scalabale
• Distributed
11. Apache Cassandra
• Multi-master, low-latency, shared nothing
• Distributed
• No single point of failure
• Linearly Scalable
• Multi-datacenter configuration
• AP with tunable consistency
12. Nodes and distributions
• Distributed by Tokens from -2^63 to 2^63-1
• Hash from partition key. Murmur3
• Virtual Nodes
• Data Centers and Racks, Gossip (each 1 sec)
• Replication Strategy (SimpleStrategy, NetworkTopologyStrategy)
• Replication factor (usually 3), Gossip and Coordinators
• Tunable consistency, strong and eventual
• Consistency Levels (One, Two, Three, Any, All, Quorum,
Local_Quorum, Local_One…)
• (R +W) > N
13. Cassandra Objects
Column, which is a name/value pair
Row, which is a container for columns referenced by
a primary key
Table, which is a container for rows
Keyspace, which is a container for tables
Cluster, which is a container for keyspaces that
spans one or more nodes Tombstones for deleted rows
TTL for deleting rows
Compaction for merging SSTables
Secondary Indexes for filtering
CQL (Cassandra Query Language)
14. CQL (Cassandra Query Language)
• Similar to SQL
• No Joins, Counters, Static Columns
• Keyspaces with replication factor
• SET, LIST, MAP, Tuples
• TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/
• Ordering and Filtering is not working sometimes (always use partition key)
CREATE TABLE loads (
machine inet,
cpu int,
mtime timeuuid,
load float,
PRIMARY KEY ((machine, cpu), mtime)
) WITH CLUSTERING ORDER BY (mtime DESC);
/* Select Data within a range */
SELECT * FROM myTable WHERE myField > 5000
AND myField < 100000;
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.
15. Data Modeling: Query-First Design
RDBMS: Data > Models > Application
Cassandra: Application > Models > Data
RDBMS
Device
User
Location
Values (Timestamp, Values)
Cassandra (no joins)
raw_values(Timestamp, Device, User, Location, Values)
day_values
hour_location_values
hour_location_device_values
17. Apache Kafka
• From LinkedIn, Open Source from 2012
• Service Bus
• Small messages (events)
• Scalable Broker System
• Durable and Distributed
• Very fast (parallelism on partitions)
• No removes from queue, retention
• Streaming processing capability
LinkedIn:
• 1400 brokers
• 13M+ messages/sec
• 2.75GB per second
19. Writes and reads
• Append Only
• Commit log
• Consumer offset (from beginning)
• Commit read to Kafka topic _consumer_offsets
• Commit when read data
• Retention period 7 days
20. Partitions, Replication, Zookeeper
• Workers, tasks, statuses
• Find leaders for a partition
• Distribute tasks
• Monitor results
• Configuration
• Health statuses
• Group memberships (elections)
• Durable and Scalable
Message (timestamp, id, Payload (binary))
Replication Factor
Partition for each broker
Partition = commit log
Replica Leader elected and saved to Zookeeper
21. Consumer Groups • Continues Polling the brokers for topic
• Consumers Group.Id for parallelism and scalability
• Consumers registered in Zookeper
• Coordinators assign partitions to consumers
• Coordinator rebalance partitions between consumers
• Amount of Consumers = Amount of Partitions
23. Confluent Platform
• Schema registry
• Kafka Connect
• REST Proxy
• Kafka Stream
• Confluent is the contributor
• Streaming platform
• Open Source main parts
24. Kafka Streams
• Easy to deploy and maintain
• Integrated with Confluent Platform
• Process topology
• Micro-services
• State store
25. KStream and KTable
• Execution topology for process
• Statefull, no need to go to DB for every event
• KStream and KTables
• KTable is local distributed database
• Data Locality, No network roundtrips
• Elastic and Scalable
27. Cluster and results
Kafka Cluster
Nodes: 9
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 100Gb
Topics: 3
Partitions: 6 per topic
Replication Factor: 3
Producers: 6
Average message size: 1Kb
440 000 messages / second
Cassandra Cluster
Nodes: 12
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 800Gb
Replication Factor: 3
Average write latency: 9 ms
Average read latency: 52 ms
28. Lessons Learned
• Amazing Capabilities, more than 1M/sec
• New mindset of streaming processing
• Time is important
• SQL Interface for streaming is not ready
• Difficulties in management and scalability
• Difficult to debug
• Lack of documentation and community support
• Design you DB carefully at the beginning for queries
• Cassandra is not RDBMS, select by partition keys
• Eventual consistency
• Very Expensive! (lots of nodes)