Spark streaming high level
overview
An abstraction over core spark that provides stream processing
functionality
Core spark
• A general, in memory data processing engine
• Batch oriented
• Main abstraction is called an RDD (resilient distributed datasets) -
• Represents a distributed collection
• Doesn’t store data
• It’s the API for defining processing steps
stream.
.map(this.tryParseRawLine _)
.filter(_.isSuccess)
.map(_.value)
.map(mapWithKey(_))
.reduceByKey{
/*reduce logic*/
}
.map(_.toString)
• Translates an expression tree into a distributed data processing application
• Serializes functions and their enclosed scope and sends them to executors
• Updates to shared objects won’t be reflected across the cluster
• Be aware not to reference large objects - use spark's ‘broadcast variables’ instead
• Operations consists of transformations (map operations) and actions - causes shuffling of the data
across nodes (group by, reduce etc...)
• A pubsub messaging system
• Supports topics – named queues
• Supports replaying messages from any index in the queue
• Scalable
• Data is partitioned and replicated
• Durable
• Messages are persisted and replicated – adds to latency
• Master slave replication at the partition level
Back to spark streaming
• API almost identical to core spark
• Micro batches - as data comes in, it is buffered during a given interval
and then served to core spark for processing
• Buffered data is served to core spark as RDDs
• Each interval produces one RDD
Data of the current batch is processed in
parallel with buffering data for the next batch
A guideline for a spark streaming application
Micro batches processing time must be less than the batch interval
• Maybe trivial, but it is the key to avoid bottlenecks and performance
degradation
Performance tuning at large
• Parallelizing data consumption from input source
• In the case of kafka - depends on topics partitions (creating multiple kafka stream consumers)
• Parallelizing data processing (spark partitions), should be balanced
with total number of cores and consumers
• Each kafka consumer uses one core
• Cores available for processing = total #cores - #consumers
• Serialization – avoid java object serialization – kyro is recommended

Spark streaming high level overview

  • 1.
    Spark streaming highlevel overview An abstraction over core spark that provides stream processing functionality
  • 2.
    Core spark • Ageneral, in memory data processing engine • Batch oriented • Main abstraction is called an RDD (resilient distributed datasets) - • Represents a distributed collection • Doesn’t store data • It’s the API for defining processing steps
  • 3.
    stream. .map(this.tryParseRawLine _) .filter(_.isSuccess) .map(_.value) .map(mapWithKey(_)) .reduceByKey{ /*reduce logic*/ } .map(_.toString) •Translates an expression tree into a distributed data processing application • Serializes functions and their enclosed scope and sends them to executors • Updates to shared objects won’t be reflected across the cluster • Be aware not to reference large objects - use spark's ‘broadcast variables’ instead • Operations consists of transformations (map operations) and actions - causes shuffling of the data across nodes (group by, reduce etc...)
  • 4.
    • A pubsubmessaging system • Supports topics – named queues • Supports replaying messages from any index in the queue • Scalable • Data is partitioned and replicated • Durable • Messages are persisted and replicated – adds to latency • Master slave replication at the partition level
  • 5.
    Back to sparkstreaming • API almost identical to core spark • Micro batches - as data comes in, it is buffered during a given interval and then served to core spark for processing • Buffered data is served to core spark as RDDs • Each interval produces one RDD
  • 6.
    Data of thecurrent batch is processed in parallel with buffering data for the next batch
  • 7.
    A guideline fora spark streaming application Micro batches processing time must be less than the batch interval • Maybe trivial, but it is the key to avoid bottlenecks and performance degradation
  • 8.
    Performance tuning atlarge • Parallelizing data consumption from input source • In the case of kafka - depends on topics partitions (creating multiple kafka stream consumers) • Parallelizing data processing (spark partitions), should be balanced with total number of cores and consumers • Each kafka consumer uses one core • Cores available for processing = total #cores - #consumers • Serialization – avoid java object serialization – kyro is recommended

Editor's Notes

  • #3 Objects in scope must be serializable – or instantiated within the function