East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Chris Fregly
Chris FreglyAI and Machine Learning @ AWS, O'Reilly Author @ Data Science on AWS, Founder @ PipelineAI, Formerly Databricks, Netflix,
Spark and Friends: 
Spark Streaming, 
Machine Learning, Graph Processing, 
Lambda Architecture, Approximations 
Chris Fregly 
East Bay Java User Group 
Oct 2014 
Kinesis 
Streaming
Who am I? 
Former Netflix’er 
(netflix.github.io) 
Spark Contributor 
(github.com/apache/spark) 
Founder 
(fluxcapacitor.com) 
Author 
(effectivespark.com, 
sparkinaction.com)
Spark Streaming-Kinesis Jira
Not-so-Quick Poll 
• Spark, Spark Streaming? 
• Hadoop, Hive, Pig? 
• Parquet, Avro, RCFile, ORCFile? 
• EMR, DynamoDB, Redshift? 
• Flume, Kafka, Kinesis, Storm? 
• Lambda Architecture? 
• PageRank, Collaborative Filtering, Recommendations? 
• Probabilistic Data Structs, Bloom Filters, HyperLogLog?
Quick Poll 
Raiders or 49’ers?
“Streaming” 
Kinesis 
Streaming 
Video 
Streaming 
Piping 
Big Data 
Streaming
Agenda 
• Spark, Spark Streaming 
• Use Cases 
• API and Libraries 
• Machine Learning 
• Graph Processing 
• Execution Model 
• Fault Tolerance 
• Cluster Deployment 
• Monitoring 
• Scaling and Tuning 
• Lambda Architecture 
• Probabilistic Data Structures 
• Approximations
Timeline of Spark Evolution
Spark and Berkeley AMP Lab 
Berkeley Data Analytics Stack (BDAS)
Spark Overview 
• Based on 2007 Microsoft Dryad paper 
• Written in Scala 
• Supports Java, Python, SQL, and R 
• Data fits in memory when possible, but not 
required 
• Improved efficiency over MapReduce 
– 100x in-memory, 2-10x on-disk 
• Compatible with Hadoop 
– File formats, SerDes, and UDFs
Spark Use Cases 
• Ad hoc, exploratory, interactive analytics 
• Real-time + Batch Analytics 
– Lambda Architecture 
• Real-time Machine Learning 
• Real-time Graph Processing 
• Approximate, Time-bound Queries
Explosion of Specialized Systems
Unified High-level Spark Libraries 
• Spark SQL (Data Processing) 
• Spark Streaming (Streaming) 
• MLlib (Machine Learning) 
• GraphX (Graph Processing) 
• BlinkDB (Approximate Queries) 
• Statistics (Correlations, Sampling, etc) 
• Others 
– Shark (Hive on Spark) 
– Spork (Pig on Spark)
Benefits of Unified Libraries 
• Advancements in higher-level libraries are 
pushed down into core and vice-versa 
• Examples 
– Spark Streaming 
• GC and memory management improvements 
– Spark GraphX 
• IndexedRDD for random, hashed-based access within 
a partition versus scanning entire partition 
– Spark Core 
• Sort-based Shuffle
Resilient Distributed Dataset 
(RDD)
RDD Overview 
• Core Spark abstraction 
• Represents partitions 
across the cluster nodes 
• Enables parallel processing 
on data sets 
• Partitions can be in-memory 
or on-disk 
• Immutable, recomputable, 
fault tolerant 
• Contains transformation history (“lineage”) for 
whole data set
Types of RDDs 
• Spark Out-of-the-Box 
– HadoopRDD 
– FilteredRDD 
– MappedRDD 
– PairRDD 
– ShuffledRDD 
– UnionRDD 
– DoubleRDD 
– JdbcRDD 
– JsonRDD 
– SchemaRDD 
– VertexRDD 
– EdgeRDD 
• External 
– CassandraRDD (DataStax) 
– GeoRDD (Esri) 
– EsSpark (ElasticSearch)
RDD Partitions
RDD Lineage Examples
Demo! 
• Load user data 
• Load gender data 
• Join user data 
with gender data 
• Analyze lineage
Join Optimizations 
• When joining large dataset with small dataset (reference 
data) 
• Broadcast small dataset to each node/partition of large 
dataset (one broadcast per node)
Spark API
Spark API Overview 
• Richer, more expressive than MapReduce 
• Native support for Java, Scala, Python, 
SQL, and R (mostly) 
• Unified API across all libraries 
• Operations 
– Transformations (lazy evaluation) 
– Actions (execute transformations)
Transformations
Actions
Spark Execution Model
Job Scheduling 
• Job 
– Contains many stages 
• Contains many tasks 
• FIFO 
– Long-running jobs may starve resources for other 
jobs 
• Fair 
– Round-robin to prevent resource starvation 
• Fair Scheduler Pools 
– High-priority pool for important jobs 
– Separate user pools 
– FIFO within the pool 
– Modeled after Hadoop Fair Scheduler
Spark Execution Model Overview 
• Parallel, distributed 
• DAG-based 
• Lazy evaluation 
• Allows optimizations 
– Reduce disk I/O 
– Reduce shuffle I/O 
– Single pass through dataset 
– Parallel execution 
– Task pipelining 
• Data locality and rack awareness 
• Worker node fault tolerance using RDD 
lineage graphs per partition
Execution Optimizations
Task Pipelining
Spark Cluster Deployment
Types of Cluster Deployments 
• Spark Standalone (default) 
• YARN 
– Allows hierarchies of resources 
– Kerberos integration 
– Multiple workloads from disparate execution frameworks 
• Hive, Pig, Spark, MapReduce, Cascading, etc 
• Mesos 
– Coarse-grained 
• Single, long-running Mesos tasks runs Spark mini tasks 
– Fine-grained 
• New Mesos task for each Spark task 
• Higher overhead 
• Not good for long-running Spark jobs like Spark Streaming
Spark Standalone Deployment
Spark YARN Deployment
Master High Availability 
• Multiple Master Nodes 
• ZooKeeper maintains current Master 
• Existing applications and workers will be 
notified of new Master election 
• New applications and workers need to 
explicitly specify current Master 
• Alternatives (Not recommended) 
– Local filesystem 
– NFS Mount
Spark After Dark 
(sparkafterdark.com)
Spark After Dark ~ Tinder
Mutual Match!
Goal: Increase Matches (1/2) 
• Top Influencers 
– PageRank 
– Most desirable people 
• People You May Know 
– Shortest Path 
– Facebook-enabled 
• Recommendations 
– Alternating Least 
Squares (ALS)
Goal: Increase Matches (2/2) 
• Cluster on Interests, 
Height, Religion, etc 
– K-Means 
– Nearest Neighbor 
• Textual Analysis 
of Profile Text 
– TF/IDF 
– N-gram 
– LDA Topic Extraction
Demo!
Spark Streaming
Spark Streaming Overview 
• Low latency, high throughput, fault-tolerance 
(mostly) 
• Long-running Spark application 
• Supports Flume, Kafka, Twitter, Kinesis, 
Socket, File, etc. 
• Graceful shutdown, in-flight message draining 
• Uses Spark Core, DAG Execution Model, and 
Fault Tolerance 
• Submits micro-batch jobs to the cluster
Spark Streaming Overview
Spark Streaming Use Cases 
• ETL and enrichment 
of streaming data 
on injest 
• Operational 
dashboards 
• Lambda 
Architecture
Discretized Stream (DStream) 
• Core Spark Streaming abstraction 
• Micro-batches of RDDs 
• Operations similar to RDD – operates on underlying RDDs
Spark Streaming API
Spark Streaming API Overview 
• Rich, expressive API similar to core 
• Operations 
– Transformations (lazy) 
– Actions (execute transformations) 
• Window and State Operations 
• Requires checkpointing to snip long-running 
DStream lineage 
• Register DStream as a Spark SQL table 
for querying?! Wow.
DStream Transformations
DStream Actions
Window and State DStream Operations
DStream Window and State Example
Spark Streaming Cluster 
Deployment
Spark Streaming Cluster Deployment
Adding Receivers
Adding Workers
Spark Streaming 
+ 
Kinesis
Spark Streaming + Kinesis Architecture
Throughput and Pricing
Demo! 
Kinesis 
Streaming 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… 
Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 
Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
Spark Streaming 
Fault Tolerance
Characteristics of Sources 
• Buffered 
– Flume, Kafka, Kinesis 
– Allows replay and back pressure 
• Batched 
– Flume, Kafka, Kinesis 
– Improves throughput at expense of duplication on 
failure 
• Checkpointed 
– Kafka, Kinesis 
– Allows replay from specific checkpoint
Message Delivery Guarantees 
• Exactly once [1] 
– No loss 
– No redeliver 
– Perfect delivery 
– Incurs higher latency for transactional semantics 
– Spark default per batch using DStream lineage 
– Degrades to less guarantees depending on source 
• At least once [1..n] 
– No loss 
– Possible redeliver 
• At most once [0,1] 
– Possible loss 
– No redeliver 
– *Best configuration if some data loss is acceptable 
• Ordered 
– Per partition: Kafka, Kinesis 
– Global across all partitions: Hard to scale
Types of Checkpoints 
Spark 
1. Spark checkpointing of StreamingContext 
DStreams and metadata 
2. Lineage of state and window DStream 
operations 
Kinesis 
3. Kinesis Client Library (KCL) checkpoints 
current position within shard 
– Checkpoint info is stored in DynamoDB per 
Kinesis application keyed by shard
Fault Tolerance 
• Points of Failure 
– Receiver 
– Driver 
– Worker/Processor 
• Possible Solutions 
– Use HDFS File Source for durability 
– Data Replication 
– Secondary/Backup Nodes 
– Checkpoints 
• Stream, Window, and State info
Streaming Receiver Failure 
• Use a backup receiver 
• Use multiple receivers pulling from multiple 
shards 
– Use checkpoint-enabled, sharded streaming 
source (ie. Kafka and Kinesis) 
• Data is replicated to 2 nodes immediately 
upon ingestion 
– Will spill to disk if doesn’t fit in memory 
• Possible loss of most-recent batch 
• Possible at-least once delivery of batch 
• Use buffered sources for replay 
– Kafka and Kinesis
Streaming Driver Failure 
• Use a backup Driver 
– Use DStream metadata checkpoint info to 
recover 
• Single point of failure – interrupts stream 
processing 
• Streaming Driver is a long-running Spark 
application 
– Schedules long-running stream receivers 
• State and Window RDD checkpoints to 
HDFS to help avoid data loss (mostly)
Stream Worker/Processor Failure 
• No problem! 
• DStream RDD partitions will be recalculated from lineage 
• Causes blip in processing during node failover
Spark Streaming 
Monitoring and Tuning
Monitoring 
• Monitor driver, receiver, worker nodes, and 
streams 
• Alert upon failure or unusually high latency 
• Spark Web UI 
– Streaming tab 
• Ganglia, CloudWatch 
• StreamingListener callback
Spark Web UI
Tuning 
• Batch interval 
– High: reduce overhead of submitting new tasks for each batch 
– Low: keeps latencies low 
– Sweet spot: DStream job time (scheduling + processing) is 
steady and less than batch interval 
• Checkpoint interval 
– High: reduce load on checkpoint overhead 
– Low: reduce amount of data loss on failure 
– Recommendation: 5-10x sliding window interval 
• Use DStream.repartition() to increase parallelism of processing 
DStream jobs across cluster 
• Use spark.streaming.unpersist=true to let the Streaming Framework 
figure out when to unpersist 
• Use CMS GC for consistent processing times
Lambda Architecture
Lambda Architecture Overview 
• Batch Layer 
– Immutable, 
Batch read, 
Append-only write 
– Source of truth 
– ie. HDFS 
• Speed Layer 
– Mutable, 
Random read/write 
– Most complex 
– Recent data only 
– ie. Cassandra 
• Serving Layer 
– Immutable, 
Random read, 
Batch write 
– ie. ElephantDB
Spark + AWS + Lambda
Spark + Lambda + GraphX + MLlib
Demo! 
• Load JSON 
• Convert to Parquet file 
• Save Parquet file to disk 
• Query Parquet file directly
Approximations
Approximation Overview 
• Required for scaling 
• Speed up analysis of large datasets 
• Reduce size of working dataset 
• Data is messy 
• Collection of data is messy 
• Exact isn’t always necessary 
• “Approximate is the new Exact”
Some Approximation Methods 
• Approximate time-bound queries 
– BlinkDB 
• Bernouilli and Poisson Sampling 
– RDD: sample(), takeSample() 
• HyperLogLog 
PairRDD: countApproxDistinctByKey() 
• Count-min Sketch 
• Spark Streaming and Twitter Algebird 
• Bloom Filters 
– Everywhere!
Approximations In Action 
Figure: Memory Savings with Approximation Techniques 
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
Spark Statistics Library 
• Correlations 
– Dependence between 2 random variables 
– Pearson, Spearman 
• Hypothesis Testing 
– Measure of statistical significance 
– Chi-squared test 
• Stratified Sampling 
– Sample separately from different sub-populations 
– Bernoulli and Poisson sampling 
– With and without replacement 
• Random data generator 
– Uniform, standard normal, and Poisson distribution
Summary 
• Spark, Spark Streaming Overview 
• Use Cases 
• API and Libraries 
• Machine Learning 
• Graph Processing 
• Execution Model 
• Fault Tolerance 
• Cluster Deployment 
• Monitoring 
• Scaling and Tuning 
• Lambda Architecture 
• Probabilistic Data Structures 
• Approximations 
http://effectivespark.com http://sparkinaction.com 
Thanks!! 
Chris Fregly 
@cfregly
1 of 83

Recommended

Spark after Dark by Chris Fregly of Databricks by
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
4K views116 slides
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio... by
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
3.6K views62 slides
Spark Summit 2014: Spark Job Server Talk by
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
4.4K views29 slides
Productionizing Spark and the REST Job Server- Evan Chan by
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
13K views72 slides
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass... by
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
4.8K views42 slides
What no one tells you about writing a streaming app by
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
1.3K views79 slides

More Related Content

What's hot

Top 5 mistakes when writing Streaming applications by
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
2K views79 slides
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin by
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
2K views35 slides
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
83.3K views77 slides
Solr + Hadoop = Big Data Search by
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
5.7K views33 slides
Spark Internals Training | Apache Spark | Spark | Anika Technologies by
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
167 views13 slides
Real time Analytics with Apache Kafka and Apache Spark by
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
84K views54 slides

What's hot(20)

Top 5 mistakes when writing Streaming applications by hadooparchbook
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook2K views
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin by Spark Summit
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit2K views
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by Helena Edelson
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson83.3K views
Solr + Hadoop = Big Data Search by Mark Miller
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller5.7K views
Spark Internals Training | Apache Spark | Spark | Anika Technologies by Anand Narayanan
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan167 views
Real time Analytics with Apache Kafka and Apache Spark by Rahul Jain
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain84K views
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark by Evan Chan
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan5.8K views
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark by Evan Chan
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan11.1K views
An introduction into Spark ML plus how to go beyond when you get stuck by Data Con LA
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA1.8K views
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi by Databricks
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks3.7K views
Breakthrough OLAP performance with Cassandra and Spark by Evan Chan
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan8.8K views
Streaming architecture patterns by hadooparchbook
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook5.4K views
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft by Chester Chen
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen330 views
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z... by Databricks
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks2.4K views
Introduction to real time big data with Apache Spark by Taras Matyashovsky
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky2.6K views
Building Scalable Data Pipelines - 2016 DataPalooza Seattle by Evan Chan
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan5.7K views
Hadoop application architectures - using Customer 360 as an example by hadooparchbook
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook6.1K views
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases by Shivji Kumar Jha
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha321 views
Data Science meets Software Development by Alexis Seigneurin
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
Alexis Seigneurin1.1K views
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data by Lightbend
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Lightbend6.5K views

Viewers also liked

Summarization and opinion detection in product reviews by
Summarization and opinion detection in product reviewsSummarization and opinion detection in product reviews
Summarization and opinion detection in product reviewspapanaboinasuman
1.1K views26 slides
Hadoop bootcamp getting started by
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting startedJWORKS powered by Ordina
485 views71 slides
On Centralizing Logs by
On Centralizing LogsOn Centralizing Logs
On Centralizing LogsSematext Group, Inc.
15.6K views39 slides
Introduction to spark by
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
3.5K views70 slides
Tuning and Debugging in Apache Spark by
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
12.3K views50 slides
Resilient Distributed DataSets - Apache SPARK by
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
6.8K views16 slides

Viewers also liked(9)

Summarization and opinion detection in product reviews by papanaboinasuman
Summarization and opinion detection in product reviewsSummarization and opinion detection in product reviews
Summarization and opinion detection in product reviews
papanaboinasuman1.1K views
Introduction to spark by Duyhai Doan
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan3.5K views
Tuning and Debugging in Apache Spark by Databricks
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks12.3K views
Resilient Distributed DataSets - Apache SPARK by Taposh Roy
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy6.8K views
Apache Spark RDDs by Dean Chen
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen20.5K views
Apache Spark in Depth: Core Concepts, Architecture & Internals by Anton Kirillov
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov9.6K views
Spark 의 핵심은 무엇인가? RDD! (RDD paper review) by Yongho Ha
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha87.2K views

Similar to East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014 by
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
17.7K views34 slides
Spark After Dark - LA Apache Spark Users Group - Feb 2015 by
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
5.1K views116 slides
Apache Spark Components by
Apache Spark ComponentsApache Spark Components
Apache Spark ComponentsGirish Khanzode
3.3K views91 slides
Intro to Apache Spark by CTO of Twingo by
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
4.1K views44 slides
Productionizing Spark and the Spark Job Server by
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
16.8K views72 slides
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov... by
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
3.5K views79 slides

Similar to East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning(20)

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014 by Chris Fregly
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly17.7K views
Spark After Dark - LA Apache Spark Users Group - Feb 2015 by Chris Fregly
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly5.1K views
Intro to Apache Spark by CTO of Twingo by MapR Technologies
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies4.1K views
Productionizing Spark and the Spark Job Server by Evan Chan
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan16.8K views
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov... by Databricks
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks3.5K views
Tuning and Monitoring Deep Learning on Apache Spark by Databricks
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks3.3K views
Introduction to apache spark by UserReport
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport1.5K views
Scaling Spark Workloads on YARN - Boulder/Denver July 2015 by Mac Moore
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore3.3K views
Migrating ETL Workflow to Apache Spark at Scale in Pinterest by Databricks
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks326 views
What No One Tells You About Writing a Streaming App: Spark Summit East talk b... by Spark Summit
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit9.3K views
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example by confluent
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent2.7K views
Spy hard, challenges of 100G deep packet inspection on x86 platform by Redge Technologies
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
Redge Technologies4.3K views
Ingesting hdfs intosolrusingsparktrimmed by whoschek
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek1.6K views
Apache Spark: The Next Gen toolset for Big Data Processing by prajods
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods3.3K views

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data by
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
333 views79 slides
Pandas on AWS - Let me count the ways.pdf by
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
191 views32 slides
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
1.9K views15 slides
Amazon reInvent 2020 Recap: AI and Machine Learning by
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
1.2K views25 slides
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
900 views39 slides
Quantum Computing with Amazon Braket by
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
1K views35 slides

More from Chris Fregly(20)

AWS reInvent 2022 reCap AI/ML and Data by Chris Fregly
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly333 views
Pandas on AWS - Let me count the ways.pdf by Chris Fregly
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
Chris Fregly191 views
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by Chris Fregly
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly1.9K views
Amazon reInvent 2020 Recap: AI and Machine Learning by Chris Fregly
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly1.2K views
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by Chris Fregly
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly900 views
Quantum Computing with Amazon Braket by Chris Fregly
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Chris Fregly1K views
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person by Chris Fregly
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly2.6K views
AWS Re:Invent 2019 Re:Cap by Chris Fregly
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
Chris Fregly2.1K views
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo... by Chris Fregly
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly3.9K views
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -... by Chris Fregly
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly1.2K views
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ... by Chris Fregly
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly3.7K views
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T... by Chris Fregly
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly597 views
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -... by Chris Fregly
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly1.1K views
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer... by Chris Fregly
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly607 views
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ... by Chris Fregly
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly5.3K views
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to... by Chris Fregly
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly2.5K views
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern... by Chris Fregly
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly963 views
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -... by Chris Fregly
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly3.9K views
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +... by Chris Fregly
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly1.4K views
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S... by Chris Fregly
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
Chris Fregly2.5K views

East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

  • 1. Spark and Friends: Spark Streaming, Machine Learning, Graph Processing, Lambda Architecture, Approximations Chris Fregly East Bay Java User Group Oct 2014 Kinesis Streaming
  • 2. Who am I? Former Netflix’er (netflix.github.io) Spark Contributor (github.com/apache/spark) Founder (fluxcapacitor.com) Author (effectivespark.com, sparkinaction.com)
  • 4. Not-so-Quick Poll • Spark, Spark Streaming? • Hadoop, Hive, Pig? • Parquet, Avro, RCFile, ORCFile? • EMR, DynamoDB, Redshift? • Flume, Kafka, Kinesis, Storm? • Lambda Architecture? • PageRank, Collaborative Filtering, Recommendations? • Probabilistic Data Structs, Bloom Filters, HyperLogLog?
  • 5. Quick Poll Raiders or 49’ers?
  • 6. “Streaming” Kinesis Streaming Video Streaming Piping Big Data Streaming
  • 7. Agenda • Spark, Spark Streaming • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations
  • 8. Timeline of Spark Evolution
  • 9. Spark and Berkeley AMP Lab Berkeley Data Analytics Stack (BDAS)
  • 10. Spark Overview • Based on 2007 Microsoft Dryad paper • Written in Scala • Supports Java, Python, SQL, and R • Data fits in memory when possible, but not required • Improved efficiency over MapReduce – 100x in-memory, 2-10x on-disk • Compatible with Hadoop – File formats, SerDes, and UDFs
  • 11. Spark Use Cases • Ad hoc, exploratory, interactive analytics • Real-time + Batch Analytics – Lambda Architecture • Real-time Machine Learning • Real-time Graph Processing • Approximate, Time-bound Queries
  • 13. Unified High-level Spark Libraries • Spark SQL (Data Processing) • Spark Streaming (Streaming) • MLlib (Machine Learning) • GraphX (Graph Processing) • BlinkDB (Approximate Queries) • Statistics (Correlations, Sampling, etc) • Others – Shark (Hive on Spark) – Spork (Pig on Spark)
  • 14. Benefits of Unified Libraries • Advancements in higher-level libraries are pushed down into core and vice-versa • Examples – Spark Streaming • GC and memory management improvements – Spark GraphX • IndexedRDD for random, hashed-based access within a partition versus scanning entire partition – Spark Core • Sort-based Shuffle
  • 16. RDD Overview • Core Spark abstraction • Represents partitions across the cluster nodes • Enables parallel processing on data sets • Partitions can be in-memory or on-disk • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set
  • 17. Types of RDDs • Spark Out-of-the-Box – HadoopRDD – FilteredRDD – MappedRDD – PairRDD – ShuffledRDD – UnionRDD – DoubleRDD – JdbcRDD – JsonRDD – SchemaRDD – VertexRDD – EdgeRDD • External – CassandraRDD (DataStax) – GeoRDD (Esri) – EsSpark (ElasticSearch)
  • 20. Demo! • Load user data • Load gender data • Join user data with gender data • Analyze lineage
  • 21. Join Optimizations • When joining large dataset with small dataset (reference data) • Broadcast small dataset to each node/partition of large dataset (one broadcast per node)
  • 23. Spark API Overview • Richer, more expressive than MapReduce • Native support for Java, Scala, Python, SQL, and R (mostly) • Unified API across all libraries • Operations – Transformations (lazy evaluation) – Actions (execute transformations)
  • 27. Job Scheduling • Job – Contains many stages • Contains many tasks • FIFO – Long-running jobs may starve resources for other jobs • Fair – Round-robin to prevent resource starvation • Fair Scheduler Pools – High-priority pool for important jobs – Separate user pools – FIFO within the pool – Modeled after Hadoop Fair Scheduler
  • 28. Spark Execution Model Overview • Parallel, distributed • DAG-based • Lazy evaluation • Allows optimizations – Reduce disk I/O – Reduce shuffle I/O – Single pass through dataset – Parallel execution – Task pipelining • Data locality and rack awareness • Worker node fault tolerance using RDD lineage graphs per partition
  • 32. Types of Cluster Deployments • Spark Standalone (default) • YARN – Allows hierarchies of resources – Kerberos integration – Multiple workloads from disparate execution frameworks • Hive, Pig, Spark, MapReduce, Cascading, etc • Mesos – Coarse-grained • Single, long-running Mesos tasks runs Spark mini tasks – Fine-grained • New Mesos task for each Spark task • Higher overhead • Not good for long-running Spark jobs like Spark Streaming
  • 35. Master High Availability • Multiple Master Nodes • ZooKeeper maintains current Master • Existing applications and workers will be notified of new Master election • New applications and workers need to explicitly specify current Master • Alternatives (Not recommended) – Local filesystem – NFS Mount
  • 36. Spark After Dark (sparkafterdark.com)
  • 37. Spark After Dark ~ Tinder
  • 39. Goal: Increase Matches (1/2) • Top Influencers – PageRank – Most desirable people • People You May Know – Shortest Path – Facebook-enabled • Recommendations – Alternating Least Squares (ALS)
  • 40. Goal: Increase Matches (2/2) • Cluster on Interests, Height, Religion, etc – K-Means – Nearest Neighbor • Textual Analysis of Profile Text – TF/IDF – N-gram – LDA Topic Extraction
  • 41. Demo!
  • 43. Spark Streaming Overview • Low latency, high throughput, fault-tolerance (mostly) • Long-running Spark application • Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc. • Graceful shutdown, in-flight message draining • Uses Spark Core, DAG Execution Model, and Fault Tolerance • Submits micro-batch jobs to the cluster
  • 45. Spark Streaming Use Cases • ETL and enrichment of streaming data on injest • Operational dashboards • Lambda Architecture
  • 46. Discretized Stream (DStream) • Core Spark Streaming abstraction • Micro-batches of RDDs • Operations similar to RDD – operates on underlying RDDs
  • 48. Spark Streaming API Overview • Rich, expressive API similar to core • Operations – Transformations (lazy) – Actions (execute transformations) • Window and State Operations • Requires checkpointing to snip long-running DStream lineage • Register DStream as a Spark SQL table for querying?! Wow.
  • 51. Window and State DStream Operations
  • 52. DStream Window and State Example
  • 57. Spark Streaming + Kinesis
  • 58. Spark Streaming + Kinesis Architecture
  • 60. Demo! Kinesis Streaming https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  • 62. Characteristics of Sources • Buffered – Flume, Kafka, Kinesis – Allows replay and back pressure • Batched – Flume, Kafka, Kinesis – Improves throughput at expense of duplication on failure • Checkpointed – Kafka, Kinesis – Allows replay from specific checkpoint
  • 63. Message Delivery Guarantees • Exactly once [1] – No loss – No redeliver – Perfect delivery – Incurs higher latency for transactional semantics – Spark default per batch using DStream lineage – Degrades to less guarantees depending on source • At least once [1..n] – No loss – Possible redeliver • At most once [0,1] – Possible loss – No redeliver – *Best configuration if some data loss is acceptable • Ordered – Per partition: Kafka, Kinesis – Global across all partitions: Hard to scale
  • 64. Types of Checkpoints Spark 1. Spark checkpointing of StreamingContext DStreams and metadata 2. Lineage of state and window DStream operations Kinesis 3. Kinesis Client Library (KCL) checkpoints current position within shard – Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard
  • 65. Fault Tolerance • Points of Failure – Receiver – Driver – Worker/Processor • Possible Solutions – Use HDFS File Source for durability – Data Replication – Secondary/Backup Nodes – Checkpoints • Stream, Window, and State info
  • 66. Streaming Receiver Failure • Use a backup receiver • Use multiple receivers pulling from multiple shards – Use checkpoint-enabled, sharded streaming source (ie. Kafka and Kinesis) • Data is replicated to 2 nodes immediately upon ingestion – Will spill to disk if doesn’t fit in memory • Possible loss of most-recent batch • Possible at-least once delivery of batch • Use buffered sources for replay – Kafka and Kinesis
  • 67. Streaming Driver Failure • Use a backup Driver – Use DStream metadata checkpoint info to recover • Single point of failure – interrupts stream processing • Streaming Driver is a long-running Spark application – Schedules long-running stream receivers • State and Window RDD checkpoints to HDFS to help avoid data loss (mostly)
  • 68. Stream Worker/Processor Failure • No problem! • DStream RDD partitions will be recalculated from lineage • Causes blip in processing during node failover
  • 70. Monitoring • Monitor driver, receiver, worker nodes, and streams • Alert upon failure or unusually high latency • Spark Web UI – Streaming tab • Ganglia, CloudWatch • StreamingListener callback
  • 72. Tuning • Batch interval – High: reduce overhead of submitting new tasks for each batch – Low: keeps latencies low – Sweet spot: DStream job time (scheduling + processing) is steady and less than batch interval • Checkpoint interval – High: reduce load on checkpoint overhead – Low: reduce amount of data loss on failure – Recommendation: 5-10x sliding window interval • Use DStream.repartition() to increase parallelism of processing DStream jobs across cluster • Use spark.streaming.unpersist=true to let the Streaming Framework figure out when to unpersist • Use CMS GC for consistent processing times
  • 74. Lambda Architecture Overview • Batch Layer – Immutable, Batch read, Append-only write – Source of truth – ie. HDFS • Speed Layer – Mutable, Random read/write – Most complex – Recent data only – ie. Cassandra • Serving Layer – Immutable, Random read, Batch write – ie. ElephantDB
  • 75. Spark + AWS + Lambda
  • 76. Spark + Lambda + GraphX + MLlib
  • 77. Demo! • Load JSON • Convert to Parquet file • Save Parquet file to disk • Query Parquet file directly
  • 79. Approximation Overview • Required for scaling • Speed up analysis of large datasets • Reduce size of working dataset • Data is messy • Collection of data is messy • Exact isn’t always necessary • “Approximate is the new Exact”
  • 80. Some Approximation Methods • Approximate time-bound queries – BlinkDB • Bernouilli and Poisson Sampling – RDD: sample(), takeSample() • HyperLogLog PairRDD: countApproxDistinctByKey() • Count-min Sketch • Spark Streaming and Twitter Algebird • Bloom Filters – Everywhere!
  • 81. Approximations In Action Figure: Memory Savings with Approximation Techniques (http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
  • 82. Spark Statistics Library • Correlations – Dependence between 2 random variables – Pearson, Spearman • Hypothesis Testing – Measure of statistical significance – Chi-squared test • Stratified Sampling – Sample separately from different sub-populations – Bernoulli and Poisson sampling – With and without replacement • Random data generator – Uniform, standard normal, and Poisson distribution
  • 83. Summary • Spark, Spark Streaming Overview • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations http://effectivespark.com http://sparkinaction.com Thanks!! Chris Fregly @cfregly