- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed.
- The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules.
- The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
Spark Streaming is a framework for processing large volumes of streaming data in near-real-time. This is an introductory presentation about how Spark Streaming and Kafka can be used for high volume near-real-time streaming data processing in a cluster. This was a guest lecture in a Stanford course.
More information on the course at http://stanford.edu/~rezab/dao/
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
Spark Streaming is a framework for processing large volumes of streaming data in near-real-time. This is an introductory presentation about how Spark Streaming and Kafka can be used for high volume near-real-time streaming data processing in a cluster. This was a guest lecture in a Stanford course.
More information on the course at http://stanford.edu/~rezab/dao/
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
Deep dive into spark streaming, topics include:
1. Spark Streaming Introduction
2. Computing Model in Spark Streaming
3. System Model & Architecture
4. Fault-tolerance, Check pointing
5. Comb on Spark Streaming
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
A short presentation I gave on why Apache Spark is such an impressive analytics platform, particularly for R and Python users. I also discuss how academia can benefit from Amazon AWS implementation.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark’s Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them. Presented by Tathagata "TD" Das from Databricks.
My talk for the Scala meetup at PayPal's Singapore office.
The intention is to focus on 3 things:
(a) two common functions in Apache Spark "aggregate" and "cogroup"
(b) Spark SQL
(c) Spark Streaming
The umbrella event is http://www.meetup.com/Singapore-Scala-Programmers/events/219613576/
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
Deep dive into spark streaming, topics include:
1. Spark Streaming Introduction
2. Computing Model in Spark Streaming
3. System Model & Architecture
4. Fault-tolerance, Check pointing
5. Comb on Spark Streaming
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
A short presentation I gave on why Apache Spark is such an impressive analytics platform, particularly for R and Python users. I also discuss how academia can benefit from Amazon AWS implementation.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark’s Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them. Presented by Tathagata "TD" Das from Databricks.
My talk for the Scala meetup at PayPal's Singapore office.
The intention is to focus on 3 things:
(a) two common functions in Apache Spark "aggregate" and "cogroup"
(b) Spark SQL
(c) Spark Streaming
The umbrella event is http://www.meetup.com/Singapore-Scala-Programmers/events/219613576/
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax Academy
It's the year 2015, and while we don't have hoverboards and self drying jackets we do have the next best thing, an Open Source Connector Between Apache Spark and Cassandra. Explore the general architecture of the connector and become an expert on how Spark and Cassandra can work together in harmony. Learn about how the DataStax Enterprise Integration with Spark provides exciting new features like Paxos Enabled High Availability to the Spark Master. Also get a sneak peak at the new and exciting features to come in the Spark Connector and the DSE Integration! If you are writing a Spark application that needs access to Cassandra, this talk is for you.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Exploring language classification with spark and the spark notebookGerard Maas
In this presentation and linked notebooks we learn the basics of creating a machine learning classifier from scratch using language classification as a running example. We start by implementing the naive intuition that letter frequency could provide a model for language classification, and then we will implement the n-gram paper from Cavnar and Trenkle.
In corresponding notebook we will create a Spark ML Transformer from the n-gram model that can be used to classify text in a Dataset or Dataframe
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
Muktadiur Rahman
Team Lead,
M&H Informatics(BD) Ltd
My presentation on Java User Group BD Meet up # 5.0 (JUGBD#5.0)
Apache Spark™ is a fast and general engine for large-scale data processing.Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
1. Apache Spark, an Introduction
Jonathan Lacefield – Solution Architect
DataStax
2. Disclaimer
The contents of this presentation represent my
personal views and do not reflect or represent
any views of my employer.
This is my take on Spark.
This is not DataStax’s take on Spark.
3. Notes
• Meetup Sponsor:
– Data Exchange Platform
– Core Software Engineering – Equifax
• Announcement:
– Data Exchange Platform is currently hiring to build the
next generation data platform. We are looking for
people with experience in one or more of the
following skills: Spark, Storm, Kafka, samza, Hadoop,
Cassandra
– How to apply?
– Email aravind.yarram@equifax.com
4. Introduction
• Jonathan Lacefield
– Solutions Architect, DataStax
– Former Dev, DBA, Architect, reformed PM
– Email: jlacefie@gmail.com
– Twitter: @jlacefie
– LinkedIn: www.linkedin.com/in/jlacefield
This deck represents my own views and not the
views of my employer
5. DataStax Introduction
DataStax delivers Apache Cassandra in a database platform
purpose built for the performance and availability demands of
IOT, web, and mobile applications, giving enterprises a secure
always-on database that remains operationally simple when
scaled in a single datacenter or across multiple datacenters
and clouds.
Includes
1. Apache Cassandra
2. Apache Spark
3. Apache SOLR
4. Apache Hadoop
5. Graph Coming Soon
6. DataStax, What we Do (Use Cases)
• Fraud Detection
• Personalization
• Internet of Things
• Messaging
• Lists of Things (Products, Playlists, etc)
• Smaller set of other things too!
We are all about working with temporal data sets at
large volumes with high transaction counts
(velocity).
7. Agenda
• Set Baseline (Pre Distributed Days and
Hadoop)
• Spark Conceptual Introduction
• Spark Key Concepts (Core)
• Spark Look at Each Module
– Spark SQL
– MLIB
– Spark Streaming
– GraphX
14. • Started in 2009 in Berkley’s AMP Lab
• Open Sources in 2010
• Commercial Provider is Databricks – http://databricks.com
• Solve 2 Big Hadoop Pain Points
Speed - In Memory and Fault Tolerant
Ease of Use – API of operations and datasets
15. Use Cases for Apache Spark
• Data ETL
• Interactive dashboard creation for customers
• Streaming (e.g., fraud detection, real-time
video optimization)
• “Complex analytics” (e.g., anomaly detection,
trend analysis)
16. Key Concepts - Core
• Resilient Distributed Datasets (RDDs) – Spark’s datasets
• Spark Context – Provides information on the Spark environment
and the application
• Transformations - Transforms data
• Actions - Triggers actual processing
• Directed Acyclic Graph (DAG) – Spark’s execution algorithm
• Broadcast Variables – Read only variables on Workers
• Accumulators – Variables that can be added to with an
associated function on Workers
• Driver - “Main” application container for Spark Execution
• Executors – Execute tasks on data
• Resource Manager – Manages task assignment and status
• Worker – Execute and Cache
17. Resilient Distributed Datasets (RDDs)
• Fault tolerant collection of elements that enable
parallel processing
• Spark’s Main Abstraction
• Transformation and Actions are executed against
RDDs
• Can persist in Memory, on Disk, or both
• Can be partitioned to control parallel processing
• Can be reused
– HUGE Efficiencies with processing
18. RDDs - Resilient
Source – databricks.com
HDFS File Filtered RDD Mapped RDD
filter
(func = someFilter(…))
map
(func = someAction(...))
RDDs track lineage information that can be used to
efficiently recompute lost data
20. RDDs – From the API
val someRdd = sc.textFile(someURL)
• Create an RDD from a text file
val lines = sc.parallelize(List("pandas", "i like pandas"))
• Create an RDD from a list of elements
• Can create RDDs from many different sources
• RDDs can, and should, be persisted in most cases
– lines.persist() or lines.cache()
• See here for more info
– http://spark.apache.org/docs/1.2.0/programming-guide.html
21. Transformations
• Create one RDD and transform the contents into another RDD
• Examples
– Map
– Filter
– Union
– Distinct
– Join
• Complete list -
http://spark.apache.org/docs/1.2.0/programming-guide.html
• Lazy execution
– Transformations aren’t applied to an RDD until an Action is executed
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
22. Actions
• Cause data to be returned to driver or saved to output
• Cause data retrieval and execution of all
Transformations on RDDs
• Common Actions
– Reduce
– Collect
– Take
– SaveAs….
• Complete list - http://spark.apache.org/docs/1.2.0/programming-
guide.html
• errorsRDD.take(1)
23. Example App
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”,
sys.argv[0], None)
lines = sc.textFile(sys.argv[1])
counts = lines.flatMap(lambda s: s.split(“ ”))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
counts.saveAsTextFile(sys.argv[2])
Based on source from – databricks.com
1
2
3
27. Spark SQL
Abstraction of Spark API to support SQL like interaction
Parse
Analyze
LogicalPlan
Optimize
Spark SQL
HiveQL
PhysicalPlan
Execute
Catalyst SQL Core
• Programming Guide - https://spark.apache.org/docs/1.2.0/sql-programming-guide.html
• Used for code source in examples
• Catalyst - http://spark-summit.org/talk/armbrust-catalyst-a-query-optimization-framework-for-spark-and-shark/
28. SQLContext and SchemaRDD
val sc: SparkContext // An existing SparkContext. val sqlContext = new
org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
SchemaRDD can be created
1) Using reflection to infer schema Structure from an existing RDD
2) Programmable interface to create Schema and apply to an RDD
29. SchemaRDD Creation - Reflection
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
30. SchemaRDD Creation - Explicit
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t => "Name: " + t(0)).collect().foreach(println)
31. Data Frames
• Data Frames will replace SchemaRDD
• https://databricks.com/blog/2015/02/17/intr
oducing-dataframes-in-spark-for-large-scale-
data-science.html
42. Initializing Streaming Context
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
46. Initializing Socket Stream
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
47. Initializing Twitter Stream
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)
48. Custom Receiver (WebSocket)
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val rsvp = ssc.receiverStream(new
WebSocketReceiver("ws://stream.meetup.com/2/rsvps"))
import org.apache.spark.streaming.receiver.Receiver
class WebSocketReceiver(url: String)
extends Receiver[String](storageLevel)
{
// ...
}
52. Multiple Streams Transformation
2 1
5 4 3 12
union
1s 1s
* Chars.union(Digits)
2 1
E D C AB
2
E 5 D 4
1s
1
C 3 B 2
1s
A 1
Digits
Chars
53. Word Count
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
55. Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
Window Length = 60 sec
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
...
56. Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
Window Length = 60s
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
14 13
5s5s
Sliding Interval =
10s
...
57. Window Length = 60s
Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
14 13
5s5s
16 15
5s5s
Sliding Interval =
10s
...
59. Word Count by Window
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKeyAndWindow((a:Int,b:Int) => a+b, Seconds(60),
Seconds(10))
60. Large Window Considerations
• Large windows:
1. Take longer to process
2. Require larger batch interval for stable processing
• Hour-scale windows are not recommended
• For multi-hour aggregations use real data stores (e.g Cassandra)
• Spark Streaming is NOT design to be a persistent data store
• Set spark.cleaner.ttl and spark.streaming.unpersist (be careful)
63. Saving to Cassandra
import org.apache.spark._
import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
64. Start Processing
import org.apache.spark._
import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
scc.start()
scc.awaitTermination()
66. Scaling Streaming
• How to scale stream processing?
Kafka
Producer
Spark
Receiver
Spark
Processor
Output
67. Parallelism – Partitioning
• Partition input stream (e.g. by topics)
• Each receiver can be run on separate worker
Kafka
Topic 2
Spark
Receiver 2
Spark
Processor
Output
Kafka
Topic 3
Spark
Receiver 3
Spark
Processor
Output
Kafka
Topic 1
Spark
Receiver 1
Spark
Processor
Output
Kafka
Topic N
Spark
Receiver N
Spark
Processor
Output
68. Parallelism – Partitioning
• Partition stream (e.g. by topics)
• Use union() to create single DStream
• Transformations applied on the unified stream
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
69. Parallelism – RePartitioning
• Explicitly repartition input stream
• Distribute received batches across specified number of machines
Twitter
Producer
Spark
Receiver
Spark
Processor
Output
Spark
Processor
Output
Spark
Processor
Output
Spark
Processor
Output
70. Parallelism – RePartitioning
• Explicitly repartition input stream
• Distribute received batches across specified number of machines
• Use inputstream.repartition(N)
val numWorkers = 5
val twitterStream = TwitterUtils.createStream(...)
twitterStream.repartition(numWorkers)
71. Parallelism – Tasks
• Each block processed by separate task
• To increase parallel tasks, increase number of blocks in a batch
• Tasks per Receiver per Batch ≈ Batch Interval / Block Interval
• Example: 2s batch / 200ms block = 10 tasks
• CPU cores will not be utilized if number of tasks is too low
• Consider tuning default number of parallel tasks
spark.default.parallelism
74. Fault Tolerance – RDD
• Recomputing RDD may be unavailable for stream source
• Protect data by replicating RDD
• RDD replication controlled by org.apache.spark.storage.StorageLevel
• Use storage level with _2 suffix (2 replicas):
– DISK_ONLY_2
– MEMORY_ONLY_2
– MEMORY_ONLY_SER_2
– MEMORY_AND_DISK_2
– MEMORY_AND_DISK_SER_2 Default for most receivers
75. Fault Tolerance – Checkpointing
• Periodically writes:
1. DAG/metadata of DStream(s)
2. RDD data for some stateful transformations (updateStateByKey &
reduceByKeyAndWindow*)
• Uses fault-tolerant distributed file system for persistence.
• After failure, StreamingContext recreated from checkpoint data
on restart.
• Choose interval carefully as storage will impact processing times.
76. Fault Tolerance – Checkpointing
import org.apache.spark._
import org.apache.spark.streaming._
val checkpointDirectory = "words.cp" // Directory name for checkpoint data
def createContext(): StreamingContext = {
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
ssc
}
val conf = new SparkConf().setAppName(appName).setMaster(master)
// Get StreamingContext from checkpoint data or create a new one
val scc = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
scc.start()
scc.awaitTermination()
79. State of Data
1. Data received and replicated
• Will survive failure of 1 replica
2. Data received but only buffered for replication
• Not replicated yet
• Needs recomputation if lost
80. Receiver Reliability Types
1. Reliable Receivers
• Receiver acknowledges source only after ensuring that data replicated.
• Source needs to support message ack. E.g. Kafka, Flume.
2. Unreliable Receivers
• Data can be lost in case of failure.
• Source doesn’t support message ack. E.g. Twitter.
81. Fault Tolerance
• Spark 1.2 adds Write Ahead Log (WAL) support for Streaming
• Protection for Unreliable Receivers
• See SPARK-3129 for architecture details
State / Receiver
Type
Received,
Replicated
Received, Only
Buffered
Reliable Receiver Safe Safe
Unreliable
Receiver
Safe Data Loss
82. GraphX
• Alpha release
• Provides Graph computation capabilities on
top of RDDs
• Resilient Distributed Property Graph: a
directed multigraph with properties attached
to each vertex and edge.
• The goal of the GraphX project is to unify
graph-parallel and data-parallel computation
in one system with a single composable API.
83. I am not a Graph-guy yet.
Who here is working with Graph today?
84. Handy Tools
• Ooyala Spark Job Server -
https://github.com/ooyala/spark-jobserver
• Monitoring with Graphite and Grafana –
http://www.hammerlab.org/2015/02/27/mon
itoring-spark-with-graphite-and-grafana/