Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming
and Friends
Chris Fregly
Global Big Data Conference
Sept 2014
Kinesis
Streaming

Who am I?
Former Netflix’er:
netflix.github.io
Spark Contributor:
github.com/apache/spark
Founder:
fluxcapacitor.com
Author:
effectivespark.com
sparkinaction.com

Quick Poll
• Hadoop, Hive, Pig?
• Spark, Spark streaming?
• EMR, Redshift?
• Flume, Kafka, Kinesis, Storm?
• Lambda Architecture?
• Bloom Filters, HyperLogLog?

“Streaming”
Kinesis
Streaming
Video
Streaming
Piping
Big Data
Streaming

Agenda
• Spark, Spark Streaming Overview
• Use Cases
• API and Libraries
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Approximations

Spark Overview (1/2)
• Berkeley AMPLab ~2009
• Part of Berkeley Data Analytics Stack (BDAS,
aka “badass”)

Spark Overview (2/2)
• Based on Microsoft Dryad paper ~2007
• Written in Scala
• Supports Java, Python, SQL, and R
• In-memory when possible, not required
• Improved efficiency over MapReduce
– 100x in-memory, 2-10x on-disk
• Compatible with Hadoop
– File formats, SerDes, and UDFs

Spark Use Cases
• Ad hoc, exploratory, interactive analytics
• Real-time + Batch Analytics
– Lambda Architecture
• Real-time Machine Learning
• Real-time Graph Processing
• Approximate, Time-bound Queries

Explosion of Specialized Systems

Unified Spark Libraries
• Spark SQL (Data Processing)
• Spark Streaming (Streaming)
• MLlib (Machine Learning)
• GraphX (Graph Processing)
• BlinkDB (Approximate Queries)
• Statistics (Correlations, Sampling, etc)
• Others
– Shark (Hive on Spark)
– Spork (Pig on Spark)

Unified Benefits
• Advancements in higher-level libraries
pushed down into core and vice-versa
• Examples
– Spark Streaming: GC and memory
management improvements
– Spark GraphX: IndexedRDD for random,
hashed access within a partition versus
scanning entire partition

Resilient Distributed Dataset (RDD)
• Core Spark abstraction
• Represents partitions
across the cluster nodes
• Enables parallel processing
on data sets
• Partitions can be in-memory or
on-disk
• Immutable, recomputable,
fault tolerant
• Contains transformation lineage on data set

Spark API Overview
• Richer, more expressive than MapReduce
• Native support for Java, Scala, Python,
SQL, and R (mostly)
• Unified API across all libraries
• Operations = Transformations + Actions

Spark Execution Model Overview
• Parallel, distributed
• DAG-based
• Lazy evaluation
• Allows optimizations
– Reduce disk I/O
– Reduce shuffle I/O
– Parallel execution
– Task pipelining
• Data locality and rack awareness
• Worker node fault tolerance using RDD
lineage graphs per partition

Master High Availability
• Multiple Master Nodes
• ZooKeeper maintains current Master
• Existing applications and workers will be
notified of new Master election
• New applications and workers need to
explicitly specify current Master
• Alternatives (Not recommended)
– Local filesystem
– NFS Mount

Spark Streaming Overview
• Low latency, high throughput, fault-tolerance
(mostly)
• Long-running Spark application
• Supports Flume, Kafka, Twitter, Kinesis,
Socket, File, etc.
• Graceful shutdown, in-flight message
draining
• Uses Spark Core, DAG Execution Model,
and Fault Tolerance

Spark Streaming Use Cases
• ETL on streaming data during ingestion
• Anomaly, malware, and fraud detection
• Operational dashboards
• Lambda architecture
– Unified batch and streaming
– ie. Different machine learning models for different
time frames
• Predictive maintenance
– Sensors
• NLP analysis
– Twitter firehose

Discretized Stream (DStream)
• Core Spark Streaming abstraction
• Micro-batches of RDDs
• Operations similar to RDD
• Fault tolerance using DStream/RDD lineage

Spark Streaming API Overview
• Rich, expressive API similar to core
• Operations
– Transformations
– Actions
• Window and State Operations
• Requires checkpointing to snip long-running
DStream lineage
• Register DStream as a Spark SQL table
for querying!

Window and State DStream Operations

Spark Streaming Cluster
Deployment

Spark Streaming Cluster Deployment

Spark Streaming + Kinesis Architecture
Kinesis
Producer
Kinesis
Producer
Spark St reaming Kinesis Archit ec t ure
Kinesis Spark St reaming,
Kinesis Cl ient Library
Appl icat ion
Kinesis St ream
Shard 1
Shard 2
Shard 3
Kinesis
Producer
Kinesis Receiver DSt ream 1
Kinesis Record
Processor Thread 1
Kinesis Record
Processor Thread 2
Kinesis Receiver DSt ream 2
Kinesis Record
Processor Thread 1

Throughput and Pricing
Spark
Kinesis
Producer
Spark St reaming Kinesis Throughput and Pr ic ing
< 10 second delay
Kinesis
Spark St reaming
Appl icat ion
Kinesis St ream
Shard 1
Spark
Kinesis
Receiver
Shard 1
Shard 1
1 MB/ sec
per shard
1000 PUTs/ sec
50K/ PUT
2 MB/ sec
per shard
Shard Cost : $ 0.36 per day per shard
PUT Cost : $ 2.50 per day per shard
Net work Transf er Cost : Free wit hin Region! !

Demo!
Kinesis
Streaming
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/…
Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java

Spark Streaming
Fault Tolerance

Fault Tolerance
• Points of Failure
– Receiver
– Driver
– Worker/Processor
• Solutions
– Data Replication
– Secondary/Backup Nodes
– Checkpoints

Streaming Receiver Failure
• Use a backup receiver
• Use multiple receivers pulling from multiple
shards
– Use checkpoint-enabled, sharded streaming
source (ie. Kafka and Kinesis)
• Data is replicated to 2 nodes immediately
upon ingestion
• Possible data loss
• Possible at-least once
• Use buffered sources (ie. Kafka and Kinesis)

Streaming Driver Failure
• Use a backup Driver
– Use DStream metadata checkpoint info to
recover
• Single point of failure – interrupts stream
processing
• Streaming Driver is a long-running Spark
application
– Schedules long-running stream receivers
• State and Window RDD checkpoints help
avoid data loss (mostly)

Stream Worker/Processor Failure
• No problem!
• DStream RDD partitions will be
recalculated from lineage

Types of Checkpoints
Spark
1. Spark checkpointing of StreamingContext
DStreams and metadata
2. Lineage of state and window DStream
operations
Kinesis
3. Kinesis Client Library (KCL) checkpoints
current position within shard
– Checkpoint info is stored in DynamoDB per
Kinesis application keyed by shard

Spark Streaming
Monitoring and Tuning

Monitoring
• Monitor driver, receiver, worker nodes, and
streams
• Alert upon failure or unusually high latency
• Spark Web UI
– Streaming tab
• Ganglia, CloudWatch
• StreamingListener callback

Tuning
• Batch interval
– High: reduce overhead of submitting new tasks for each batch
– Low: keeps latencies low
– Sweet spot: DStream job time (scheduling + processing) is
steady and less than batch interval
• Checkpoint interval
– High: reduce load on checkpoint overhead
– Low: reduce amount of data loss on failure
– Recommendation: 5-10x sliding window interval
• Use DStream.repartition() to increase parallelism of processing
DStream jobs across cluster
• Use spark.streaming.unpersist=true to let the Streaming Framework
figure out when to unpersist
• Use CMS GC for consistent processing times

Lambda Architecture Overview
• Batch Layer
– Immutable,
Batch read,
Append-only write
– Source of truth
– ie. HDFS
• Speed Layer
– Mutable,
Random read/write
– Most complex
– Recent data only
– ie. Cassandra
• Serving Layer
– Immutable,
Random read,
Batch write
– ie. ElephantDB

Approximation Overview
• Required for scaling
• Speed up analysis of large datasets
• Reduce size of working dataset
• Data is messy
• Collection of data is messy
• Exact isn’t always necessary
• “Approximate is the new Exact”

Some Approximation Methods
• Approximate time-bound queries
– BlinkDB
• Bernouilli and Poisson Sampling
– RDD: sample(), RDD.takeSample()
• HyperLogLog
PairRDD: countApproxDistinctByKey()
• Count-min Sketch
• Spark Streaming and Twitter Algebird
• Bloom Filters
– Everywhere!

Approximations In Action
Figure: Memory Savings with Approximation Techniques
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)

Spark Statistics Library
• Correlations
– Dependence between 2 random variables
– Pearson, Spearman
• Hypothesis Testing
– Measure of statistical significance
– Chi-squared test
• Stratified Sampling
– Sample separately from different sub-populations
– Bernoulli and Poisson sampling
– With and without replacement
• Random data generator
– Uniform, standard normal, and Poisson distribution

Summary
• Spark, Spark Streaming Overview
• Use Cases
• API and Libraries
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Approximations
Oct 2014 MEAP Early Access
http://sparkinaction.com

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Similar to Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture (20)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture