SlideShare a Scribd company logo
REAL-TIME ANALYTICS
WITH KAFKA, CASSANDRA
& STORM
Dr. John Georgiadis
Modio Computing
Modio Computing
USE CASES
• Collecting/processing measurements from large
sensor networks (e.g. weather data).
• Aggregated processing of financial trading streams.
• Customer activity monitoring for advertising
purposes, fraud detection, etc.
• Real-time security log processing.
Modio Computing
SOLUTION APPROACH
• Real-time Updates: Employ streaming instead of batch analytics.
• Apache Storm: Large installation base. Streaming & micro-batch.
• Apache Spark: Uniform API for batch & micro-batch. On top of
YARN/HDFS. Micro-batch less mature but catching-up quickly.
• Large data sets +Time Series + Write-Intensive + Data Expiration =
Apache Cassandra
Modio Computing
ARCHITECTURE
Modio Computing
APACHE KAFKA
• N Nodes
• TTopics
• Replication Factor: Defines high availability
• Partitions:They define parallelism level.A single consumer per
partition.
• Consumer discovers cluster nodes through Zookeeper
• Consumer partition state is just an integer: the partition offset.
Modio Computing
STORM
• Storm is a distributed computing platform. In Storm a distributed
computation is a directed graph of interconnected processors (topology)
that exchange messages.
• Spouts: Processors that inject messages into the topology.
• Bolts: Processors that process messages including sending to 3rd parties
(.e.g persistence).
• Trident: High-level operations on message batches. Support batch replay
in case of failure.Translates to a graph of low-level spouts and bolts.
Modio Computing
STORM :: NIMBUS
• A single controller (Nimbus) 

where topologies are submitted. 

Nimbus breaks topologies in tasks and 

forwards to supervisors which spawn 

one or more workers(processes) per 

task.
Nimbus redistributes tasks in case a supervisor fails.
Nimbus is not HA. If Nimbus fails, running topologies are not
affected.
Modio Computing
STORM :: SUPERVISOR
• 1 supervisor per host.
• Supervisor registers with ZK at startup and thus it’s discoverable
by Nimbus.
• Supervisor spawns Worker JVMs: one process per topology.
• JAR submitted to Nimbus is copied to Worker classpath.
• When a Supervisor dies, all Worker tasks are migrated to the
remaining Supervisors.
Modio Computing
STORM ::TRIDENT
• When to use Micro-batch (akaTrident) instead 

of Streaming.
• Millisecond latency not required.Typical 

Trident latency threshold: 500ms.
Allows batch mode persistence operations.
High-level abstractions: partitionBy, partitionAggregate, stateQuery,
partitionPersist.
Batch processing timeout/exception will cause a replay of the batch
provided the Spout supports replays (Kafka does):At-least-once semantics.
Modio Computing
STORM :: PARALLELISM
• Parallelism = Number of threads executing a topology cluster-wide.
• Parallelism <= CPU threads/worker x Workers
• Define per-topology max #workers (explicitly) and max parallelism (implicitly).
• Define explicitly topology step parallelism. 

Max parallelism = Σ(step parallelism).
• Trident merges multiple steps into the same thread/node. Last parallelism statement is the
effective parallelism.
• Repartition operations define step merging boundaries.
• Repartition operations (shuffle, partitionBy, broadcast, etc.) imply network transfer and are
expensive. In some cases they are disastrous to performance!
Modio Computing
STORM :: PERFORMANCETUNING
• Spouts must match upstream parallelism: one spout per Kafka partition.
• Little’s Law: Batch Size =Throughput x Latency
• Adjust batch size: (Kafka Partitions) x (Kafka fetch size)

Larger batch size = {higher throughput, higher latency}

Increase batch size gradually until latency starts increasing sharply.
• Identify the slowest stage (I/O or CPU bound):
• You can’t have better throughput than the throughput of your slowest stage.
• You can’t have better latency than the sum of individual latencies.
• If CPU bound, increase parallelism. If I/O bound increase downstream (i.e storage) capacity.
Modio Computing
CASSANDRA ::THE GOOD
• Great write performance. Decent read performance.
• Write latency: 20μs-120μs
• 15K writes/sec on a single node
• Extremely stable.Very low record of data corruption.
• Decentralized setup: all cluster nodes have the same setup.
• Multi-datacenter setups.
• Configurable consistency of updates: ONE, QUORUM,ALL.
• TTL per cell (row & column).
• Detailed metrics: #operations, latencies, thread pools, memory, cache performance.
Modio Computing
CASSANDRA ::THE “FEATURES”
• All partition keys must be set in queries.
• All primary keys preceding an initialized primary key with a value must also be initialized in
queries.
• TTL is not allowed on cells containing counters.
• NULL values are not supported on primary keys.
• Range queries can only be applied on the last column of the composite primary key that
appears in the query.
• Disjunction operator (OR) is not available.The IN keyword can be used in some cases instead.
• Row counting is a very expensive operation.
Modio Computing
CASSANDRA :: PERFORMANCE
• Design the schema around the partition key.
• Keep each partition size small (no more than a few 100s entry) as reading will fetch the whole partition.
• Leverage Key cache
• Avoid making time fragments part of the partition key as this will direct all activity to the node that is the
partition owner at a given date.
• Query/Update Plan:
• Avoid range queries and the IN operator as it requires contacting multiple nodes and assembling the
results at the coordinator node.
• Use prepared statements to avoid repeated statement parsing.
• Prefer async writes combined with a max pending statements threshold.
• Best performance out of batches containing statements with the same partition key.
Modio Computing
CASSANDRA :: CLUSTER
• One or more data nodes are also “seed”nodes 

acting as the membership gatekeepers.
• Table sharding across the cluster based on the 

partition key hash (token).
• Table replication according to replication 

factor (RF). Configurable per keyspace (database).
• The Java driver has several load balancing approaches:
• Token-aware: sends each statement to the node that actually will store it. Random selection
amongst the nodes for a given replica set.
• Latency-aware: sends each statement to the node with the fastest response.
• Round-robin & custom load balancers supported.
Modio Computing
• Kafka
• N-way replication: N-1 node failures.
• Clients dynamically reconfigured if accessing through Zookeeper.
• Storm
• For cluster size N, X supervisor failures provided (N-X) nodes have memory 

to accommodate X JVM worker processes.
• Incomplete batches replayed:At-least-once semantics.
• Cassandra
• N-way replication: N-1 node failures if using ONE consistency, if using QUORUM consistency.
• Clients require a list of all cluster nodes.
• Zookeeper
• Majority voting is required:At most F failures in cluster with 2F+1 nodes. Leader re-election very fast (200ms).
• Sizes bigger than 3-5 not recommended due to decreasing write performance.
• If majority voting is lost, Storm will stop. Kafka will fail to commit client offsets. If majority is regained Storm will resume. Kafka
brokers will resume in most cases.
FAILURE SCENARIOS

More Related Content

What's hot

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
Trong Ton
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Folio3 Software
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
DECK36
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
Eugene Dvorkin
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
T Jake Luciani
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
Humoyun Ahmedov
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
Lester Martin
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
Nati Shalom
 
How Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm PipelinesHow Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm Pipelines
Kinshuk Mishra
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
Md. Shamsur Rahim
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
Robert Evans
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Data Con LA
 

What's hot (19)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
How Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm PipelinesHow Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm Pipelines
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 

Viewers also liked

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
Guido Schmutz
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
Evernote
EvernoteEvernote
Evernote
Lee Wayne
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
SOA & Big Data
SOA & Big DataSOA & Big Data
SOA & Big Data
Arnon Rotem-Gal-Oz
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
Patrick McFadin
 
Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
Demed L'Her
 
10 Productivity Tips From Hootsuite & Evernote
10 Productivity Tips From Hootsuite & Evernote10 Productivity Tips From Hootsuite & Evernote
10 Productivity Tips From Hootsuite & Evernote
Hootsuite
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
mperham
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
Johan Andrén
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
Andrea Iacono
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Martin Zapletal
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at DevoxxNathan Bijnens
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
SingleStore
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 

Viewers also liked (20)

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Evernote
EvernoteEvernote
Evernote
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
SOA & Big Data
SOA & Big DataSOA & Big Data
SOA & Big Data
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
10 Productivity Tips From Hootsuite & Evernote
10 Productivity Tips From Hootsuite & Evernote10 Productivity Tips From Hootsuite & Evernote
10 Productivity Tips From Hootsuite & Evernote
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 

Similar to Real-Time Analytics with Kafka, Cassandra and Storm

Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
Instaclustr
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
DataStax
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
Shyam Raj
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax Academy
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
Julien Anguenot
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Basics of JVM Tuning
Basics of JVM TuningBasics of JVM Tuning
Basics of JVM Tuning
Vladislav Gangan
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
Julien Anguenot
 
Devops kc
Devops kcDevops kc
Devops kc
Philip Thompson
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Spark Summit
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
András Fehér
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
Piotr Pelczar
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Boris Yen
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
Jacky Chu
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
Ilya Bogunov
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
Scott Mansfield
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
 

Similar to Real-Time Analytics with Kafka, Cassandra and Storm (20)

Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Basics of JVM Tuning
Basics of JVM TuningBasics of JVM Tuning
Basics of JVM Tuning
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
Devops kc
Devops kcDevops kc
Devops kc
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 

Recently uploaded

Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 

Recently uploaded (20)

Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 

Real-Time Analytics with Kafka, Cassandra and Storm

  • 1. REAL-TIME ANALYTICS WITH KAFKA, CASSANDRA & STORM Dr. John Georgiadis Modio Computing
  • 2. Modio Computing USE CASES • Collecting/processing measurements from large sensor networks (e.g. weather data). • Aggregated processing of financial trading streams. • Customer activity monitoring for advertising purposes, fraud detection, etc. • Real-time security log processing.
  • 3. Modio Computing SOLUTION APPROACH • Real-time Updates: Employ streaming instead of batch analytics. • Apache Storm: Large installation base. Streaming & micro-batch. • Apache Spark: Uniform API for batch & micro-batch. On top of YARN/HDFS. Micro-batch less mature but catching-up quickly. • Large data sets +Time Series + Write-Intensive + Data Expiration = Apache Cassandra
  • 5. Modio Computing APACHE KAFKA • N Nodes • TTopics • Replication Factor: Defines high availability • Partitions:They define parallelism level.A single consumer per partition. • Consumer discovers cluster nodes through Zookeeper • Consumer partition state is just an integer: the partition offset.
  • 6. Modio Computing STORM • Storm is a distributed computing platform. In Storm a distributed computation is a directed graph of interconnected processors (topology) that exchange messages. • Spouts: Processors that inject messages into the topology. • Bolts: Processors that process messages including sending to 3rd parties (.e.g persistence). • Trident: High-level operations on message batches. Support batch replay in case of failure.Translates to a graph of low-level spouts and bolts.
  • 7. Modio Computing STORM :: NIMBUS • A single controller (Nimbus) 
 where topologies are submitted. 
 Nimbus breaks topologies in tasks and 
 forwards to supervisors which spawn 
 one or more workers(processes) per 
 task. Nimbus redistributes tasks in case a supervisor fails. Nimbus is not HA. If Nimbus fails, running topologies are not affected.
  • 8. Modio Computing STORM :: SUPERVISOR • 1 supervisor per host. • Supervisor registers with ZK at startup and thus it’s discoverable by Nimbus. • Supervisor spawns Worker JVMs: one process per topology. • JAR submitted to Nimbus is copied to Worker classpath. • When a Supervisor dies, all Worker tasks are migrated to the remaining Supervisors.
  • 9. Modio Computing STORM ::TRIDENT • When to use Micro-batch (akaTrident) instead 
 of Streaming. • Millisecond latency not required.Typical 
 Trident latency threshold: 500ms. Allows batch mode persistence operations. High-level abstractions: partitionBy, partitionAggregate, stateQuery, partitionPersist. Batch processing timeout/exception will cause a replay of the batch provided the Spout supports replays (Kafka does):At-least-once semantics.
  • 10. Modio Computing STORM :: PARALLELISM • Parallelism = Number of threads executing a topology cluster-wide. • Parallelism <= CPU threads/worker x Workers • Define per-topology max #workers (explicitly) and max parallelism (implicitly). • Define explicitly topology step parallelism. 
 Max parallelism = Σ(step parallelism). • Trident merges multiple steps into the same thread/node. Last parallelism statement is the effective parallelism. • Repartition operations define step merging boundaries. • Repartition operations (shuffle, partitionBy, broadcast, etc.) imply network transfer and are expensive. In some cases they are disastrous to performance!
  • 11. Modio Computing STORM :: PERFORMANCETUNING • Spouts must match upstream parallelism: one spout per Kafka partition. • Little’s Law: Batch Size =Throughput x Latency • Adjust batch size: (Kafka Partitions) x (Kafka fetch size)
 Larger batch size = {higher throughput, higher latency}
 Increase batch size gradually until latency starts increasing sharply. • Identify the slowest stage (I/O or CPU bound): • You can’t have better throughput than the throughput of your slowest stage. • You can’t have better latency than the sum of individual latencies. • If CPU bound, increase parallelism. If I/O bound increase downstream (i.e storage) capacity.
  • 12. Modio Computing CASSANDRA ::THE GOOD • Great write performance. Decent read performance. • Write latency: 20μs-120μs • 15K writes/sec on a single node • Extremely stable.Very low record of data corruption. • Decentralized setup: all cluster nodes have the same setup. • Multi-datacenter setups. • Configurable consistency of updates: ONE, QUORUM,ALL. • TTL per cell (row & column). • Detailed metrics: #operations, latencies, thread pools, memory, cache performance.
  • 13. Modio Computing CASSANDRA ::THE “FEATURES” • All partition keys must be set in queries. • All primary keys preceding an initialized primary key with a value must also be initialized in queries. • TTL is not allowed on cells containing counters. • NULL values are not supported on primary keys. • Range queries can only be applied on the last column of the composite primary key that appears in the query. • Disjunction operator (OR) is not available.The IN keyword can be used in some cases instead. • Row counting is a very expensive operation.
  • 14. Modio Computing CASSANDRA :: PERFORMANCE • Design the schema around the partition key. • Keep each partition size small (no more than a few 100s entry) as reading will fetch the whole partition. • Leverage Key cache • Avoid making time fragments part of the partition key as this will direct all activity to the node that is the partition owner at a given date. • Query/Update Plan: • Avoid range queries and the IN operator as it requires contacting multiple nodes and assembling the results at the coordinator node. • Use prepared statements to avoid repeated statement parsing. • Prefer async writes combined with a max pending statements threshold. • Best performance out of batches containing statements with the same partition key.
  • 15. Modio Computing CASSANDRA :: CLUSTER • One or more data nodes are also “seed”nodes 
 acting as the membership gatekeepers. • Table sharding across the cluster based on the 
 partition key hash (token). • Table replication according to replication 
 factor (RF). Configurable per keyspace (database). • The Java driver has several load balancing approaches: • Token-aware: sends each statement to the node that actually will store it. Random selection amongst the nodes for a given replica set. • Latency-aware: sends each statement to the node with the fastest response. • Round-robin & custom load balancers supported.
  • 16. Modio Computing • Kafka • N-way replication: N-1 node failures. • Clients dynamically reconfigured if accessing through Zookeeper. • Storm • For cluster size N, X supervisor failures provided (N-X) nodes have memory 
 to accommodate X JVM worker processes. • Incomplete batches replayed:At-least-once semantics. • Cassandra • N-way replication: N-1 node failures if using ONE consistency, if using QUORUM consistency. • Clients require a list of all cluster nodes. • Zookeeper • Majority voting is required:At most F failures in cluster with 2F+1 nodes. Leader re-election very fast (200ms). • Sizes bigger than 3-5 not recommended due to decreasing write performance. • If majority voting is lost, Storm will stop. Kafka will fail to commit client offsets. If majority is regained Storm will resume. Kafka brokers will resume in most cases. FAILURE SCENARIOS