The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture
1. Spark Streaming
and Friends
Chris Fregly
Global Big Data Conference
Sept 2014
Kinesis
Streaming
2. Who am I?
Former Netflix’er:
netflix.github.io
Spark Contributor:
github.com/apache/spark
Founder:
fluxcapacitor.com
Author:
effectivespark.com
sparkinaction.com
6. Agenda
• Spark, Spark Streaming Overview
• Use Cases
• API and Libraries
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Approximations
7. Spark Overview (1/2)
• Berkeley AMPLab ~2009
• Part of Berkeley Data Analytics Stack (BDAS,
aka “badass”)
8. Spark Overview (2/2)
• Based on Microsoft Dryad paper ~2007
• Written in Scala
• Supports Java, Python, SQL, and R
• In-memory when possible, not required
• Improved efficiency over MapReduce
– 100x in-memory, 2-10x on-disk
• Compatible with Hadoop
– File formats, SerDes, and UDFs
12. Unified Benefits
• Advancements in higher-level libraries
pushed down into core and vice-versa
• Examples
– Spark Streaming: GC and memory
management improvements
– Spark GraphX: IndexedRDD for random,
hashed access within a partition versus
scanning entire partition
14. Resilient Distributed Dataset (RDD)
• Core Spark abstraction
• Represents partitions
across the cluster nodes
• Enables parallel processing
on data sets
• Partitions can be in-memory or
on-disk
• Immutable, recomputable,
fault tolerant
• Contains transformation lineage on data set
16. Spark API Overview
• Richer, more expressive than MapReduce
• Native support for Java, Scala, Python,
SQL, and R (mostly)
• Unified API across all libraries
• Operations = Transformations + Actions
24. Master High Availability
• Multiple Master Nodes
• ZooKeeper maintains current Master
• Existing applications and workers will be
notified of new Master election
• New applications and workers need to
explicitly specify current Master
• Alternatives (Not recommended)
– Local filesystem
– NFS Mount
26. Spark Streaming Overview
• Low latency, high throughput, fault-tolerance
(mostly)
• Long-running Spark application
• Supports Flume, Kafka, Twitter, Kinesis,
Socket, File, etc.
• Graceful shutdown, in-flight message
draining
• Uses Spark Core, DAG Execution Model,
and Fault Tolerance
27. Spark Streaming Use Cases
• ETL on streaming data during ingestion
• Anomaly, malware, and fraud detection
• Operational dashboards
• Lambda architecture
– Unified batch and streaming
– ie. Different machine learning models for different
time frames
• Predictive maintenance
– Sensors
• NLP analysis
– Twitter firehose
28. Discretized Stream (DStream)
• Core Spark Streaming abstraction
• Micro-batches of RDDs
• Operations similar to RDD
• Fault tolerance using DStream/RDD lineage
30. Spark Streaming API Overview
• Rich, expressive API similar to core
• Operations
– Transformations
– Actions
• Window and State Operations
• Requires checkpointing to snip long-running
DStream lineage
• Register DStream as a Spark SQL table
for querying!
40. Spark Streaming + Kinesis Architecture
Kinesis
Producer
Kinesis
Producer
Spark St reaming Kinesis Archit ec t ure
Kinesis Spark St reaming,
Kinesis Cl ient Library
Appl icat ion
Kinesis St ream
Shard 1
Shard 2
Shard 3
Kinesis
Producer
Kinesis Receiver DSt ream 1
Kinesis Cl ient Library
Kinesis Record
Processor Thread 1
Kinesis Record
Processor Thread 2
Kinesis Receiver DSt ream 2
Kinesis Cl ient Library
Kinesis Record
Processor Thread 1
41. Throughput and Pricing
Spark
Kinesis
Producer
Spark St reaming Kinesis Throughput and Pr ic ing
< 10 second delay
Kinesis
Spark St reaming
Appl icat ion
Kinesis St ream
Shard 1
Spark
Kinesis
Receiver
Shard 1
Shard 1
1 MB/ sec
per shard
1000 PUTs/ sec
50K/ PUT
2 MB/ sec
per shard
Shard Cost : $ 0.36 per day per shard
PUT Cost : $ 2.50 per day per shard
Net work Transf er Cost : Free wit hin Region! !
45. Streaming Receiver Failure
• Use a backup receiver
• Use multiple receivers pulling from multiple
shards
– Use checkpoint-enabled, sharded streaming
source (ie. Kafka and Kinesis)
• Data is replicated to 2 nodes immediately
upon ingestion
• Possible data loss
• Possible at-least once
• Use buffered sources (ie. Kafka and Kinesis)
46. Streaming Driver Failure
• Use a backup Driver
– Use DStream metadata checkpoint info to
recover
• Single point of failure – interrupts stream
processing
• Streaming Driver is a long-running Spark
application
– Schedules long-running stream receivers
• State and Window RDD checkpoints help
avoid data loss (mostly)
48. Types of Checkpoints
Spark
1. Spark checkpointing of StreamingContext
DStreams and metadata
2. Lineage of state and window DStream
operations
Kinesis
3. Kinesis Client Library (KCL) checkpoints
current position within shard
– Checkpoint info is stored in DynamoDB per
Kinesis application keyed by shard
52. Tuning
• Batch interval
– High: reduce overhead of submitting new tasks for each batch
– Low: keeps latencies low
– Sweet spot: DStream job time (scheduling + processing) is
steady and less than batch interval
• Checkpoint interval
– High: reduce load on checkpoint overhead
– Low: reduce amount of data loss on failure
– Recommendation: 5-10x sliding window interval
• Use DStream.repartition() to increase parallelism of processing
DStream jobs across cluster
• Use spark.streaming.unpersist=true to let the Streaming Framework
figure out when to unpersist
• Use CMS GC for consistent processing times
58. Approximation Overview
• Required for scaling
• Speed up analysis of large datasets
• Reduce size of working dataset
• Data is messy
• Collection of data is messy
• Exact isn’t always necessary
• “Approximate is the new Exact”
60. Approximations In Action
Figure: Memory Savings with Approximation Techniques
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
61. Spark Statistics Library
• Correlations
– Dependence between 2 random variables
– Pearson, Spearman
• Hypothesis Testing
– Measure of statistical significance
– Chi-squared test
• Stratified Sampling
– Sample separately from different sub-populations
– Bernoulli and Poisson sampling
– With and without replacement
• Random data generator
– Uniform, standard normal, and Poisson distribution
62. Summary
• Spark, Spark Streaming Overview
• Use Cases
• API and Libraries
• Execution Model
• Fault Tolerance
• Cluster Deployment
• Monitoring
• Scaling and Tuning
• Lambda Architecture
• Approximations
Oct 2014 MEAP Early Access
http://sparkinaction.com