Spark streaming high level overview

Spark streaming high level
overview
An abstraction over core spark that provides stream processing
functionality

Core spark
• A general, in memory data processing engine
• Batch oriented
• Main abstraction is called an RDD (resilient distributed datasets) -
• Represents a distributed collection
• Doesn’t store data
• It’s the API for defining processing steps

stream.
.map(this.tryParseRawLine _)
.filter(_.isSuccess)
.map(_.value)
.map(mapWithKey(_))
.reduceByKey{
/*reduce logic*/
}
.map(_.toString)
• Translates an expression tree into a distributed data processing application
• Serializes functions and their enclosed scope and sends them to executors
• Updates to shared objects won’t be reflected across the cluster
• Be aware not to reference large objects - use spark's ‘broadcast variables’ instead
• Operations consists of transformations (map operations) and actions - causes shuffling of the data
across nodes (group by, reduce etc...)

• A pubsub messaging system
• Supports topics – named queues
• Supports replaying messages from any index in the queue
• Scalable
• Data is partitioned and replicated
• Durable
• Messages are persisted and replicated – adds to latency
• Master slave replication at the partition level

Back to spark streaming
• API almost identical to core spark
• Micro batches - as data comes in, it is buffered during a given interval
and then served to core spark for processing
• Buffered data is served to core spark as RDDs
• Each interval produces one RDD

Data of the current batch is processed in
parallel with buffering data for the next batch

A guideline for a spark streaming application
Micro batches processing time must be less than the batch interval
• Maybe trivial, but it is the key to avoid bottlenecks and performance
degradation

Performance tuning at large
• Parallelizing data consumption from input source
• In the case of kafka - depends on topics partitions (creating multiple kafka stream consumers)
• Parallelizing data processing (spark partitions), should be balanced
with total number of cores and consumers
• Each kafka consumer uses one core
• Cores available for processing = total #cores - #consumers
• Serialization – avoid java object serialization – kyro is recommended

Spark streaming high level overview

More Related Content

What's hot

Viewers also liked

Similar to Spark streaming high level overview

Recently uploaded

Spark streaming high level overview

Editor's Notes